{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Evaluation metrics in NLP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Fall 2020\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Classifier metrics](#Classifier-metrics)\n",
    "  1. [Confusion matrix](#Confusion-matrix)\n",
    "  1. [Accuracy](#Accuracy)\n",
    "  1. [Precision](#Precision)\n",
    "  1. [Recall](#Recall)\n",
    "  1. [F scores](#F-scores)\n",
    "  1. [Macro-averaged F scores](#Macro-averaged-F-scores)\n",
    "  1. [Weighted F scores](#Weighted-F-scores)\n",
    "  1. [Micro-averaged F scores](#Micro-averaged-F-scores)\n",
    "  1. [Precision–recall curves](#Precision–recall-curves)\n",
    "  1. [Average precision](#Average-precision)\n",
    "  1. [Receiver Operating Characteristic (ROC) curve](#Receiver-Operating-Characteristic-(ROC)-curve)\n",
    "1. [Regression metrics](#Regression-metrics)\n",
    "  1. [Mean squared error](#Mean-squared-error)\n",
    "  1. [R-squared scores](#R-squared-scores)\n",
    "  1. [Pearson correlation](#Pearson-correlation)\n",
    "  1. [Spearman rank correlation](#Spearman-rank-correlation)\n",
    "1. [Sequence prediction](#Sequence-prediction)\n",
    "  1. [Word error rate](#Word-error-rate)\n",
    "  1. [BLEU scores](#BLEU-scores)\n",
    "  1. [Perplexity](#Perplexity)\n",
    "1. [Other resources](#Other-resources)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Overview\n",
    "\n",
    "1. Different evaluation metrics __encode different values__ and have __different biases and other weaknesses__. Thus, you should choose your metrics carefully, and motivate those choices when writing up and presenting your work.\n",
    "\n",
    "1. This notebook reviews some of the most prominent evaluation metrics in NLP, seeking not only to define them, but also to articulate what values they encode and what their weaknesses are.\n",
    "\n",
    "1. In your own work, __you shouldn't feel confined to these metrics__. Per item 1 above, you should feel that you have the freedom to motivate new metrics and specific uses of existing metrics, depending on what your goals are.\n",
    "\n",
    "1. If you're working on an established problem, then you'll feel pressure from readers (and referees) to use the metrics that have already been used for the problem. This might be a compelling pressure. However, you should always feel free to argue against those cultural norms and motivate new ones. Areas can stagnate due to poor metrics, so we must be vigilant!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "This notebook discusses prominent metrics in NLP evaluations. I've had to be selective to keep the notebook from growing too long and complex. I think the measures and considerations here are fairly representative of the issues that arise in NLP evaluation.\n",
    "\n",
    "The scikit-learn [model evaluation usage guide](http://scikit-learn.org/stable/modules/model_evaluation.html) is excellent as a source of implementations, definitions, and references for a wide range of metrics for classification, regression, ranking, and clustering.\n",
    "\n",
    "This notebook is the first in a two-part series on evaluation. Part 2 is on [evaluation methods](evaluation_methods.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "from nltk.metrics.distance import edit_distance\n",
    "from nltk.translate import bleu_score\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy.stats\n",
    "from sklearn import metrics\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "utils.fix_random_seeds()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Classifier metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Confusion matrix\n",
    "\n",
    "A confusion matrix gives a complete comparison of how the observed/gold labels compare to the labels predicted by a classifier.\n",
    "\n",
    "`ex1 = `\n",
    "<table>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td>15</td>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td>10</td>\n",
    "<td>15</td>\n",
    "<td>10</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "<td>1000</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "For classifiers that predict real values (scores, probabilities), it is important to remember that __a threshold was imposed to create these categorical predictions__. \n",
    "\n",
    "The position of this threshold can have a large impact on the overall assessment that uses the confusion matrix as an input. The default is to choose the class with the highest probability. This is so deeply ingrained that it is often not even mentioned. However, it might be inappropriate:\n",
    "\n",
    "  1. We might care about the full distribution.\n",
    "  1. Where the important class is very small relative to the others, any significant amount of positive probability for it might be important.\n",
    "\n",
    "Metrics like [average precision](#Average-precision) explore this threshold as part of their evaluation procedure. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "This function creates the toy confusion matrices that we will use for illustrative examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "def illustrative_confusion_matrix(data):\n",
    "    classes = ['pos', 'neg', 'neutral']\n",
    "    ex = pd.DataFrame(\n",
    "        data,\n",
    "        columns=classes,\n",
    "        index=classes)\n",
    "    ex.index.name = \"observed\"\n",
    "    return ex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "ex1 = illustrative_confusion_matrix([\n",
    "    [15,  10,  100],\n",
    "    [10,  15,   10],\n",
    "    [10, 100, 1000]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Accuracy\n",
    "\n",
    "[Accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) is the sum of the correct predictions divided by the sum of all predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def accuracy(cm):\n",
    "    return cm.values.diagonal().sum() / cm.values.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an illustrative confusion matrix:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ex1 = `\n",
    "<table>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td style=\"background-color: green\">15</td>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td>10</td>\n",
    "<td style=\"background-color: green\">15</td>\n",
    "<td>10</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "<td style=\"background-color: green\">1000</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8110236220472441"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy(ex1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Accuracy bounds\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by accuracy\n",
    "\n",
    "Accuracy seems to directly encode a core value we have for classifiers – how often they are correct. In addition, the accuracy of a classifier on a test set will be negatively correlated with the [negative log (logistic, cross-entropy) loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss), which is a common loss for classifiers. In this sense, these classifiers are optimizing for accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of accuracy\n",
    "\n",
    "* Accuracy does not give per-class metrics for multi-class problems.\n",
    "\n",
    "* Accuracy fails to control for size imbalances in the classes. For instance, consider the variant of the above in which the classifier guessed only __neutral__:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "ex2 = illustrative_confusion_matrix([\n",
    "    [0, 0,  125],\n",
    "    [0, 0,   35],\n",
    "    [0, 0, 1110]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         0    0      125\n",
       "neg         0    0       35\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Intuitively, this is a worse classifier than the one that produced `ex1`. Whereas `ex1` does well at __pos__ and __neg__ despite their small size, this classifier doesn't even try to get them right – it always predicts __neutral__. However, its accuracy is higher!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8110236220472441\n",
      "0.8740157480314961\n"
     ]
    }
   ],
   "source": [
    "print(accuracy(ex1))\n",
    "print(accuracy(ex2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to accuracy\n",
    "\n",
    "* Accuracy is inversely proportional to the [negative log (logistic, cross-entropy) loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss) that many classifiers optimize:\n",
    "\n",
    "$$\n",
    "-\\frac{1}{N} \\sum_{i=1}^{N} \\sum_{k=1}^{K} y_{i,k} \\log(p_{i,k})\n",
    "$$\n",
    "\n",
    "* Accuracy can be related in a similar way to [KL divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence):    \n",
    "$$\n",
    "D_{\\text{KL}}(y \\parallel p) = \n",
    "    \\sum _{k=1}^{K} y_{k} \\log\\left(\\frac {y_{k}}{p_{k}}\\right)\n",
    "$$\n",
    "  Where $y$ is a \"one-hot vector\" (a classification label) with $1$ at position $k$, this reduces to \n",
    "  $$\n",
    "  \\log\\left(\\frac{1}{p_{k}}\\right) = -\\log(p_{k})\n",
    "  $$\n",
    "  Thus, KL-divergence is an analogue of accuracy for soft labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Precision\n",
    "\n",
    "[Precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) is the sum of the correct predictions divided by the sum of all guesses. This is a per-class notion; in our confusion matrices, it's the diagonal values divided by the column sums:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def precision(cm):\n",
    "    return cm.values.diagonal() / cm.sum(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ex1 =`\n",
    "<table>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td style=\"background-color: #ADD8E6; font-weight: bold\">15</td>\n",
    "<td style=\"background-color: #00FFAA\">10</td>\n",
    "<td style=\"background-color: #FFC686\">100</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td style=\"background-color: #ADD8E6\">10</td>\n",
    "<td style=\"background-color: #00FFAA; font-weight: bold\">15</td>\n",
    "<td style=\"background-color: #FFC686\">10</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td style=\"background-color: #ADD8E6\">10</td>\n",
    "<td style=\"background-color: #00FFAA\">100</td>\n",
    "<td style=\"background-color: #FFC686; font-weight: bold\">1000</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th>precision</th>\n",
    "<td>0.43</td>\n",
    "<td>0.12</td>\n",
    "<td>0.90</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        0.428571\n",
       "neg        0.120000\n",
       "neutral    0.900901\n",
       "dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "precision(ex1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "For our problematic __all neutral__ classifier above, precision is strictly speaking undefined for __pos__ and __neg__:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         0    0      125\n",
       "neg         0    0       35\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos             NaN\n",
       "neg             NaN\n",
       "neutral    0.874016\n",
       "dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "precision(ex2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's common to see these `NaN` values mapped to 0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Precision bounds\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best. (Caveat: undefined values resulting from dividing by 0 need to be mapped to 0.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by precision\n",
    "\n",
    "Precision encodes a _conservative_ value in penalizing incorrect guesses."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of precision\n",
    "\n",
    "Precision's dangerous edge case is that one can achieve very high precision for a category by rarely guessing it. Consider, for example, the following classifier's flawless predictions for __pos__ and __neg__. These predictions are at the expense of __neutral__, but that is such a big class that it hardly matters to the precision for that class either."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "ex3 = illustrative_confusion_matrix([\n",
    "    [1, 0,  124],\n",
    "    [0, 1,   24],\n",
    "    [0, 0, 1110]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         1    0      124\n",
       "neg         0    1       24\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        1.000000\n",
       "neg        1.000000\n",
       "neutral    0.882353\n",
       "dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "precision(ex3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These numbers mask the fact that this is a very poor classifier!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Compare with our less imbalanced `ex1`; for \"perfect\" precision on `pos` and `neg`, we incurred only a small drop in `neutral` here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>10</td>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos        15   10      100\n",
       "neg        10   15       10\n",
       "neutral    10  100     1000"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        0.428571\n",
       "neg        0.120000\n",
       "neutral    0.900901\n",
       "dtype: float64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "precision(ex1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Recall\n",
    "\n",
    "[Recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) is the sum of the correct predictions divided by the sum of all true instances. This is a per-class notion; in our confusion matrices, it's the diagonal values divided by the row sums. Recall is sometimes called the \"true positive rate\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recall(cm):\n",
    "    return cm.values.diagonal() / cm.sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ex1 =`\n",
    "<table>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "<th></th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "<th>recall</th>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td style=\"background-color: #ADD8E6; font-weight: bold\">15</td>\n",
    "<td style=\"background-color: #ADD8E6\">10</td>\n",
    "<td style=\"background-color: #ADD8E6\">100</td>\n",
    "<td>0.12</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td style=\"background-color: #00FFAA\">10</td>\n",
    "<td style=\"background-color: #00FFAA; font-weight: bold\">15</td>\n",
    "<td style=\"background-color: #00FFAA\">10</td>\n",
    "<td>0.43</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td style=\"background-color: #FFC686\">10</td>\n",
    "<td style=\"background-color: #FFC686\">100</td>\n",
    "<td style=\"background-color: #FFC686; font-weight: bold\">1000</td>\n",
    "<td>0.90</td>    \n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "observed\n",
       "pos        0.120000\n",
       "neg        0.428571\n",
       "neutral    0.900901\n",
       "dtype: float64"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recall(ex1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Recall trades off against precision. For instance, consider again `ex3`, in which the classifier was very conservative with __pos__ and __neg__:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ex3 =`\n",
    "<table>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "<th></th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "<th>recall</th>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td style=\"background-color: #CCCCCC; font-weight: bold\">1</td>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC\">124</td>\n",
    "<td>0.008</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC; font-weight: bold\">1</td>\n",
    "<td style=\"background-color: #CCCCCC\">24</td>\n",
    "<td>0.040</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC; font-weight: bold\">1110</td>\n",
    "<td>1.000</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th>precision</th>\n",
    "<td>1.00</td>\n",
    "<td>1.00</td>\n",
    "<td>0.88</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Recall bounds\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by recall\n",
    "\n",
    "Recall encodes a _permissive_ value in penalizing only missed true cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of recall\n",
    "\n",
    "Recall's dangerous edge case is that one can achieve very high recall for a category by always guessing it. This could mean a lot of incorrect guesses, but recall sees only the correct ones. You can see this in `ex3` above. The model did make some incorrect __neutral__ predictions, but it missed none, so it achieved perfect recall for that category.\n",
    "\n",
    "`ex3 =`\n",
    "<table>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "<th></th>\n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "<th>recall</th>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td style=\"background-color: #CCCCCC; font-weight: bold\">1</td>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC\">124</td>\n",
    "<td>0.008</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC; font-weight: bold\">1</td>\n",
    "<td style=\"background-color: #CCCCCC\">24</td>\n",
    "<td>0.040</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC\">0</td>\n",
    "<td style=\"background-color: #CCCCCC; font-weight: bold\">1110</td>\n",
    "<td>1.000</td>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th>precision</th>\n",
    "<td>1.00</td>\n",
    "<td>1.00</td>\n",
    "<td>0.88</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### F scores\n",
    "\n",
    "[F scores](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score) combine precision and recall via their harmonic mean, with a value $\\beta$ that can be used to emphasize one or the other. Like precision and recall, this is a per-category notion.\n",
    "\n",
    "$$\n",
    "(\\beta^{2}+1) \\cdot \\frac{\\textbf{precision} \\cdot\n",
    "          \\textbf{recall}}{(\\beta^{2} \\cdot \\textbf{precision}) +\n",
    "          \\textbf{recall}}\n",
    "$$\n",
    "\n",
    "Where $\\beta=1$, we have F1:\n",
    "\n",
    "$$\n",
    "2 \\cdot \\frac{\\textbf{precision} \\cdot \\textbf{recall}}{\\textbf{precision} + \\textbf{recall}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def f_score(cm, beta):\n",
    "    p = precision(cm)\n",
    "    r = recall(cm)\n",
    "    return (beta**2 + 1) * ((p * r) / ((beta**2 * p) + r))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def f1_score(cm):\n",
    "    return f_score(cm, beta=1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>10</td>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos        15   10      100\n",
       "neg        10   15       10\n",
       "neutral    10  100     1000"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        0.187500\n",
       "neg        0.187500\n",
       "neutral    0.900901\n",
       "dtype: float64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(ex1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         0    0      125\n",
       "neg         0    0       35\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos             NaN\n",
       "neg             NaN\n",
       "neutral    0.932773\n",
       "dtype: float64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(ex2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         1    0      124\n",
       "neg         0    1       24\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        0.015873\n",
       "neg        0.076923\n",
       "neutral    0.937500\n",
       "dtype: float64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(ex3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of F scores\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best, and guaranteed to be between precision and recall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by F scores\n",
    "\n",
    "The F$_{\\beta}$ score for a class $K$ is an attempt to summarize how well the classifier's $K$ predictions align with the true instances of $K$. Alignment brings in both missed cases and incorrect predictions. Intuitively, precision and recall keep each other in check in the calculation. This idea runs through almost all robust classification metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of F scores\n",
    "\n",
    "* There is no normalization for the size of the dataset within $K$ or outside of it.\n",
    "\n",
    "* For a given category $K$, the F$_{\\beta}$ score for $K$ ignores  all the values that are off the row and column for $K$, which might be the majority of the data. This means that the individual scores for a category can be very misleading about the overall performance of the system. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ex1 = `\n",
    "<table display=\"inline\">\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "<th></th>      \n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "<th>F1</th>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td>15</td>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "<td>0.187</td>      \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td>10</td>\n",
    "<td>15</td>\n",
    "<td>10</td>\n",
    "<td>0.187</td>     \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "<td style=\"background-color: #D050D0\">1,000</td>\n",
    "<td>0.90</td>     \n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "\n",
    "`ex4 =`\n",
    "<table display=\"inline\">\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th colspan=3 style=\"text-align:center\">predicted</th>\n",
    "<th></th>      \n",
    "</tr>\n",
    "<tr>\n",
    "<th></th>\n",
    "<th></th>\n",
    "<th>pos</th>\n",
    "<th>neg</th>\n",
    "<th>neutral</th>\n",
    "<th>F1</th>    \n",
    "</tr>\n",
    "<tr>\n",
    "<th rowspan=3>gold</th>\n",
    "<th>pos</th>\n",
    "<td>15</td>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "<td>0.187</td>      \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neg</th>\n",
    "<td>10</td>\n",
    "<td>15</td>\n",
    "<td>10</td>\n",
    "<td>0.187</td>     \n",
    "</tr>\n",
    "<tr>\n",
    "<th>neutral</th>\n",
    "<td>10</td>\n",
    "<td>100</td>\n",
    "<td style=\"background-color: #D050D0\">100,000</td>\n",
    "<td>0.999</td>     \n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to F scores\n",
    "\n",
    "* Dice similarity for binary vectors is sometimes used to assess how well a model has learned to identify a set of items. In this setting, [it is equivalent to the per-token F1 score](https://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/).\n",
    "\n",
    "* The intuition behind F scores (balancing precision and recall) runs through many of the metrics discussed below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Macro-averaged F scores\n",
    "\n",
    "The [macro-averaged F$_{\\beta}$ score](http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification) (macro F$_{\\beta}$) is the mean of the F$_{\\beta}$ score for each category:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def macro_f_score(cm, beta):\n",
    "    return f_score(cm, beta).mean(skipna=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>10</td>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos        15   10      100\n",
       "neg        10   15       10\n",
       "neutral    10  100     1000"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        0.187500\n",
       "neg        0.187500\n",
       "neutral    0.900901\n",
       "dtype: float64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(ex1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.42530030030030036"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macro_f_score(ex1, beta=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         0    0      125\n",
       "neg         0    0       35\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos             NaN\n",
       "neg             NaN\n",
       "neutral    0.932773\n",
       "dtype: float64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(ex2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macro_f_score(ex2, beta=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos         1    0      124\n",
       "neg         0    1       24\n",
       "neutral     0    0     1110"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pos        0.015873\n",
       "neg        0.076923\n",
       "neutral    0.937500\n",
       "dtype: float64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(ex3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.34343203093203095"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macro_f_score(ex3, beta=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of macro-averaged F scores\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best, and guaranteed to be between precision and recall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by macro-averaged F scores\n",
    "\n",
    "Macro F$_{\\beta}$ scores inherit the values of F$_{\\beta}$ scores, and they additionally say that we care about all the classes equally regardless of their size. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of macro-averaged F scores\n",
    "\n",
    "In NLP, we typically care about modeling all of the classes well, so macro-F$_{\\beta}$ scores often seem appropriate. However, this is also the source of their primary weaknesses:\n",
    "\n",
    "* If a model is doing really well on a small class $K$, its high macro F$_{\\beta}$ score might mask the fact that it mostly makes incorrect predictions outside of $K$. So F$_{\\beta}$ scoring will make this kind of classifier look better than it is.\n",
    "\n",
    "* Conversely, if a model does well on a very large class, its overall performance might be high even if it stumbles on some small classes. So F$_{\\beta}$ scoring will make this kind of classifier look worse than it is, as measured by sheer number of good predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Weighted F scores\n",
    "\n",
    "[Weighted F$_{\\beta}$ scores](http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification) average the per-category F$_{\\beta}$ scores, but it's a weighted average based on the size of the classes in the observed/gold data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "def weighted_f_score(cm, beta):\n",
    "    scores = f_score(cm, beta=beta).values\n",
    "    weights = cm.sum(axis=1)\n",
    "    return np.average(scores, weights=weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.828993812624765"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weighted_f_score(ex3, beta=1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of weighted F scores\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best, but without a guarantee that it will be between precision and recall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by weighted F scores\n",
    "\n",
    "Weighted F$_{\\beta}$ scores inherit the values of F$_{\\beta}$ scores, and they additionally say that we want to weight the summary by the number of actual and predicted examples in each class. This will probably correspond well with how the classifier will perform, on a per example basis, on data with the same class distribution as the training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of weighted F scores\n",
    "\n",
    "Large classes will dominate these calculations. Just like macro-averaging, this can make a classifier look artificially good or bad, depending on where its errors tend to occur."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Micro-averaged F scores\n",
    "\n",
    "[Micro-averaged F$_{\\beta}$ scores](http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification) (micro F$_{\\beta}$ scores) add up the 2 $\\times$ 2 confusion matrices for each category versus the rest, and then they calculate the F$_{\\beta}$ scores, with the convention being that the positive class's F$_{\\beta}$ score is reported. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "This function creates the 2 $\\times$ 2 matrix for a category `cat` in a confusion matrix `cm`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "def cat_versus_rest(cm, cat):\n",
    "    yes = cm.loc[cat, cat]\n",
    "    yes_no = cm.loc[cat].sum() - yes\n",
    "    no_yes = cm[cat].sum() - yes\n",
    "    no = cm.values.sum() - yes - yes_no - no_yes\n",
    "    return pd.DataFrame(\n",
    "        [[yes,    yes_no],\n",
    "         [no_yes,    no]],\n",
    "        columns=['yes', 'no'],\n",
    "        index=['yes', 'no'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "      <th>neutral</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>observed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos</th>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neg</th>\n",
       "      <td>10</td>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>neutral</th>\n",
       "      <td>10</td>\n",
       "      <td>100</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos  neg  neutral\n",
       "observed                   \n",
       "pos        15   10      100\n",
       "neg        10   15       10\n",
       "neutral    10  100     1000"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>yes</th>\n",
       "      <th>no</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>15</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>20</td>\n",
       "      <td>1125</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     yes    no\n",
       "yes   15   110\n",
       "no    20  1125"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>yes</th>\n",
       "      <th>no</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>15</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>110</td>\n",
       "      <td>1125</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     yes    no\n",
       "yes   15    20\n",
       "no   110  1125"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>yes</th>\n",
       "      <th>no</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>1000</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>110</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      yes   no\n",
       "yes  1000  110\n",
       "no    110   50"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(ex1)\n",
    "display(cat_versus_rest(ex1, 'pos'))\n",
    "display(cat_versus_rest(ex1, 'neg'))\n",
    "display(cat_versus_rest(ex1, 'neutral'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>yes</th>\n",
       "      <th>no</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>1030</td>\n",
       "      <td>240</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>240</td>\n",
       "      <td>2300</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      yes    no\n",
       "yes  1030   240\n",
       "no    240  2300"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum([cat_versus_rest(ex1, cat) for cat in ex1.index])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "For the micro F$_{\\beta}$ score, we just add up these per-category confusion matrices and calculate the F$_{\\beta}$ score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "def micro_f_score(cm, beta):\n",
    "    c = sum([cat_versus_rest(cm, cat) for cat in cm.index])\n",
    "    return f_score(c, beta=beta).loc['yes']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8110236220472442"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "micro_f_score(ex1, beta=1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of micro-averaged F scores\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best, and guaranteed to be between precision and recall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by micro-averaged F scores\n",
    "\n",
    "* Micro F$_{\\beta}$ scores inherit the values of weighted F$_{\\beta}$ scores. (The resulting scores tend to be very similar.)\n",
    "\n",
    "* For two-class problems, this has an intuitive interpretation in which precision and recall are defined in terms of correct and incorrect guesses ignoring the class. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of micro-averaged F scores\n",
    "\n",
    "The weaknesses too are the same as those of weighted F$_{\\beta}$ scores, with the additional drawback that we actually get two potentially very different values, for the positive and negative classes, and we have to choose one to meet our goal of having a single summary number. (See the `'yes'` in the final line of `micro_f_score`.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to micro-averaged F scores\n",
    "\n",
    "* Micro-averaging is equivalent to accuracy.\n",
    "\n",
    "* F1 is identical to both precision and recall on the 2 $\\times$ 2 matrix that is the basis for the calculation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Precision–recall curves\n",
    "\n",
    "I noted above that confusion matrices hide a threshold for turning probabilities/scores into predicted labels. With precision–recall curves, we finally address this.\n",
    "\n",
    "A precision–recall curve is a method for summarizing the relationship between precision and recall for a binary classifier. \n",
    "\n",
    "The basis for this calculation is not the confusion matrix, but rather the raw scores or probabilities returned by the classifier. Normally, we use 0.5 as the threshold for saying that a prediction is positive. However, each distinct real value in the set of predictions is a potential threshold. The precision–recall curve explores this space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Here's a basic implementation; [the sklearn version](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) is more flexible and so recommended for real experimental frameworks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "def precision_recall_curve(y, probs):\n",
    "    \"\"\"`y` is a list of labels, and `probs` is a list of predicted\n",
    "    probabilities or predicted scores -- likely a column of the\n",
    "    output of `predict_proba` using an `sklearn` classifier.\n",
    "    \"\"\"\n",
    "    thresholds = sorted(set(probs))\n",
    "    data = []\n",
    "    for t in thresholds:\n",
    "        # Use `t` to create labels:\n",
    "        pred = [1 if p >= t else 0 for p in probs]\n",
    "        # Precision/recall analysis as usual, focused on\n",
    "        # the positive class:\n",
    "        cm = pd.DataFrame(metrics.confusion_matrix(y, pred))\n",
    "        prec = precision(cm)[1]\n",
    "        rec = recall(cm)[1]\n",
    "        data.append((t, prec, rec))\n",
    "    # For intuitive graphs, always include this end-point:\n",
    "    data.append((None, 1, 0))\n",
    "    return pd.DataFrame(\n",
    "        data, columns=['threshold', 'precision', 'recall'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "I'll illustrate with a hypothetical binary classification problem involving balanced classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "y = np.random.choice((0, 1), size=1000, p=(0.5, 0.5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Suppose our classifier is generally able to distinguish the two classes, but it never predicts a value above 0.4, so our usual methods of thresholding at 0.5 would make the classifier look very bad:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = [np.random.uniform(0.0, 0.3) if x == 0 else np.random.uniform(0.1, 0.4)\n",
    "         for x in y]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The precision–recall curve can help us identify the optimal threshold given whatever our real-world goals happen to be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "prc = precision_recall_curve(y, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "def plot_precision_recall_curve(prc):\n",
    "    ax1 = prc.plot.scatter(x='recall', y='precision', legend=False)\n",
    "    ax1.set_xlim([0, 1])\n",
    "    ax1.set_ylim([0, 1.1])\n",
    "    ax1.set_ylabel(\"precision\")\n",
    "    ax2 = ax1.twiny()\n",
    "    ax2.set_xticklabels(prc['threshold'].values[::100].round(3))\n",
    "    _ = ax2.set_xlabel(\"threshold\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAAEjCAYAAAAc4VcXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3de5xVdb3/8ddnZpiRuXAbFOV+VQNSUBIxE/NS6CmooxWWmeWJUx5L85z6Wfbo9LM6p/KUmZpG5a/sYYJ2jsWjX3nNC/IThAAVMGTkOiAXh/sAAzN8fn+sNcxmmMuaNXvty8z7+XjwYO+9vmvtz/6ymfes9V3ru8zdERER6aiCbBcgIiL5SQEiIiKxKEBERCQWBYiIiMSiABERkVgUICIiEosCRLoVM+tjZjeGjy82sz8l8B7Xm9m9HVxnvZn1b+H1b5vZv6WvOpH0UYBId9MHuLEjK5hZYUK1iOQ1BYh0N98HRpnZcuBOoNzMfm9mfzezh83M4NgewbfM7CXgY2Y2ysyeMLO/mdl8MzszbPcxM1thZq+a2Ysp7zMwbL/GzH7Y+KKZXWNmr4fr/KClAs3sdjNbbWbPAGck1REinVWU7QJEMuw2YLy7TzCzi4E/AuOALcAC4L3AS2HbQ+5+IYCZPQt8wd3XmNlk4GfAJcC3gA+6+2Yz65PyPhOAiUAdsNrM7gEagB8A5wK7gKfM7CPu/ofGlczsXGBmuG4RsBT4W/q7QaTzFCDS3b3i7tUA4V7JcJoCZG74ejlwAfBYuIMCUBL+vQD4tZk9CvxPynafdfc94fqrgGFAJfC8u+8IX38YuAj4Q8p67wMed/cDYZt5afukImmmAJHuri7lcQPH/5+oDf8uAHa7+4TmK7v7F8I9kn8AlptZY5uWtmvN12+FJqiTvKAxEOlu9gEVHVnB3fcC68zsYwAWODt8PMrdF7n7t4B3gCFtbGoRMNXM+ocD89cALzRr8yLwUTPraWYVwIc7UqtIJmkPRLoVd68xswVmtgI4CGyLuOqngPvN7JtAD2AO8Cpwp5mNIdi7eDZ87YQ9lfC93zazrwPPhe3/7O5/bNZmqZnNBZYDG4D5Hf2MIplims5dRETi0CEsERGJRQEiIiKxKEBERCSWrAeImU0Lr7qtMrPbWlheYmZzw+WLzGx4yrKvh6+vNrMPprz+oJltDwdK80LcfjCz4WZ20MyWh38eSFmn2Mxmm9mb4ZXWV2XuE8UXoS8uMrOlZlZvZlc3W/aEme1uPseVmd0Ubs9bmnMqV3WmL8Llvcxsc+rcXGb2PTPbZGb7k64/neL2hZm9P+X/x3IzO2RmH2m27j351B8R+uJWM1tlZq+Z2bNmNixl2WcsmCFhjZl9poV150X+2enuWfsDFAJvASOBYoIzWMY2a3Mj8ED4eCYwN3w8NmxfAowIt1MYLrsIOAdYkc3Pl6F+GN7a5wT+N/Dd8HEB0D/bnzVNfTEcOAt4CLi62bJLCU59/VOz1yeG663Ph35IR1+Ey+8Gfgfcm/La+cBpwP5sf8ZM9kXYph+wEyhNeW0S8Nt86Y+IffH+xs8IfDHl50U/YG34d9/wcd+U9f4x/L5E+tmZ7T2Q84Aqd1/r7ocJTo2c0azNDOA34ePfA5dacDnwDGCOu9e5+zqgKtwe7v4iwZckX3SmH9ryOeA/Adz9qLu/k8aak9JuX7j7end/DTjafGV3f5bgWo/mry9z9/XJlJyYTvVFOC3KAOCpZussdPe3kys7EZ3qixRXA3/xpiv9CwnmRPtaMmUnIkpfPNf4GYGFwODw8QeBp919p7vvAp4GpsGxGRduBb4btZBsB8ggYFPK8+rwtRbbuHs9sIdgSogo6+aLzvQDwAgzW2ZmL5jZ+yCYtjxc9p1wt/4xMxuQ2CdIn67079pZsfvCzAqAHwFfTaCubEjX92Im8EjK85uAeXkWqB3tixuAv0RY9zsE35kDRJTtAGnpN+jmF6a01ibKuvmiM/3wNjDU3ScS/PbwOzPrRXCR6GBggbufA7wM/Ff6Sk5MV/p37azO9MWNBBcqbmq3ZX7o9PfCzE4D3g08GT4fCHwMuKfT1WVW5L4ws2sJDtHd2da6FkzBM9rdH+9IIdkOkGqOn/phMMGsqC22MbMioDfB4ako6+aL2P0QHsKrAXD3vxEcGz0dqCH4TaLxC/EYwbhQrutK/66d1Zm+mALcZGbrCX5xuM7Mvp/e8jIqHd+LjxNMVHkkfD4RGA1Uhf1UamZVnS00AyL1hZldBtwOTHf3unbWnQKcG/bDS8DpZvZ8u5VkeTCoiGAQZwRNg0HjmrX5F44fPH40fDyO4wfR1xIOonvTgFq+DKJ3ph9OpunkgZHAZqBf+HwOcEn4+HrgsWx/1nT0RUrbX9PywPHFNBtET1m2nvwZRO90X6T829/bwut5MWicxu/FQuD9bbxHXvRHxJ8XEwl+mRzT7PV+wDqCAfS+4eN+zdpE/tmZC51xJfBm+GFvD1+7gyA1AU4i+O25CngFGJmy7u3hequBK1Jef4Tg0M4RgsS9IdufM6l+AK4CVoZfoqXAh1O2OYxgcr7XCOZpGprtz5mmvnhP+O9aS7CntTJl3fnADoJ5rqoJ7tUB8OXweT3Bb1y/zPbnTLovUrZxPcefhfXDcJ2j4d/fzvbnzMD3YjjBL1cFbWw/LwIkYl88QzDP2/Lwz7yUdT8X/hypAj7bwraHEzFANBeWiIjEku0xEBERyVMKEBERiUUBIiIisShAREQklrwLEDOble0acoX6oon6oon6oon6okkSfZF3AQLoC9FEfdFEfdFEfdFEfdFEASIiIrkh764DMTMvLS3Ndhk5ob6+nqKiomyXkRPUF03UF03UF00OHDjg7p7WnYa869nS0lJqa2uzXYaISF4xs4Pp3qYOYYmISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhJLYvcDMbMHgQ8B2919fAvLDbgbuBI4AFzv7kvb227DUee+v75JcVEhh+sbTvi7X1kJADtr61ptk/Tf2awhXz5/v7ISBvUtZdzAXlSWl6TjKyciGZbkDaV+DdwLPNTK8iuAMeGfycD94d9tOlx/lDufWpOmEiXbCg3u+sQEpk8YlO1SRKSDEjuE5e4vAjvbaDIDeMgDC4E+ZnZaUvVIbmpwuGXucmr212W7FBHpoGyOgQwCNqU8rw5fO4GZzTKzJWa2xI82ZKQ4yZyjDi+/VZPtMkSkg7IZINbCa95SQ3ef7e6T3H2SFRQmXJZkw4aa/dkuQUQ6KJsBUg0MSXk+GNiSpVokyzbtOpjtEkSkg5IcRG/PPOAmM5tDMHi+x93fbm+l4qICvvqBMXl/FlJXfO8oNew9WM/P56874d91zuJq3j2oD586f1h6vl0ikjhzb/GoUec3bPYIcDHQH9gG/DvQA8DdHwhP470XmEZwGu9n3X1Je9stKyvz2traRGqW5L26aTdX37+AI0dPXFZo8Mrtl+m0XpEEmNkBdy9L5zYT2wNx92vaWe7AvyT1/pKbBvftiVkBcGKCNDg8vXIrMydrL0QkH+hKdMmoyvIS/n362FaX3/b4Cr71x9czWJGIxKUAkYz71ORhzDi79Ut+Hnp5I1Xb9mWwIhGJQwEiWTFxaN82lz+5cmuGKhGRuBQgkhUXju7f5vI7n3qThxdtyFA1IhKHAkSyYvSACq6bMrTNNrc/voKHFypERHKVAkSy5o4Z7+Y/P3rCRM3H+eYfVmieLJEcpQCRrPrAuFMpbGlSm5AD//TrxQoRkRykAJGsqiwv4a5PTKCgjRBZVr2HSd99hnnLN2euMBFplwJEsm76hEE8dctFbbZx4JY5mvZdJJcoQCQnjB5QwffaGQ85Ctwyd1lmChKRdilAJGd8avIwvnHlmW22mb+mRhcZiuSIxCZTTIomU+z6Hl64gdv/sKLV5ReOqmTmeUPo1bNY91QXiSiJyRQVIJKTqrbt47K7Xmy3ne6pLhJNXs3GK9IZowdU8MWpI7n/hbVttmtwuHnOcgb2PolJIyqp2V/Hyi172HvwCID2UkQSpD0QyVk1++s497vPRG4/rO9JbNx16IT7Ihvwj+cM4otTRzF6QEVaaxTJF0nsgWgQXXJWZXkJ118Q/d4gG1oIDwhOAf7vpZu57K4XNVW8SBopQCSnfemSMWndnqaKF0kfBYjktMryEn46c0Kb05101EtVO9K3MZFuTIPokvOmTxjEe0f3PzY4/vjyzTz7RvwQeGPz3jRWJ9J9aRBd8lLVtn385Jk3+dPr8W48deaAMh7+/BSdnSXdhq4DQQEix2s8bXfzroPsrK2jX1kJg/qWMm5gL37+wlvMnr+uzfV/OlPXkEj3oABBASLRvfjmDq578JU22xQavHL7ZdoTkS5Pp/GKdMC4gb3aHXxvcFi5ZU9mChLpYhQg0mU13mukqJ1v+ZtbdVqvSBw6hCVdXs3+On45f22r06IYcOvlp/PJyUN1KEu6LI2BoACR+H7y9Gp+8mxVq8sLDK4cfyrXXzCcA0caANM8WtJlaDJFkU440nC0zeVHHf70+tbjTg024OtXnMmsqaMSrk4k/2gMRLqN0uKO/77kwH/85e9c98uFup2uSDMKEOk2Pjju1NjrvlhVw7nffYaHF21IY0Ui+U0BIt3G6AEVXDdlaKe2cfvjK3h4oUJEBDSILt1Q1bZ93P98Ff+9bEus9QuAxd/UxYeSX/LuLCwzmwbcDRQCv3T37zdbPhT4DdAnbHObu/+5rW0qQCRdUqdB+evft/P0G9sjr/vQ597DRaefkmB1IumVVwFiZoXAm8DlQDWwGLjG3VeltJkNLHP3+81sLPBndx/e1nYVIJKU1Nvh3vPXNaze1vr3bMrIvtz7yXO1FyJ5I9+mMjkPqHL3te5+GJgDzGjWxoFe4ePeQLxjCiJpUFlewkWnn8KHzh7Ek1+5mJ98/KxW2768dhfnfvcZ5i3fnMEKRXJLkgEyCNiU8rw6fC3Vt4Frzawa+DPwpZY2ZGazzGyJmS2pr69PolaRE3zknCGMH9j2PdRvnrNcp/dKt5VkgLQ0jV3z42XXAL9298HAlcBvzeyEmtx9trtPcvdJRUW69lEy5wtTR7e53IFb5i7LTDEiOSbJAKkGhqQ8H8yJh6huAB4FcPeXgZOA/gnWJNIhU0ZVtvibUKr5a2p0n3XplpIMkMXAGDMbYWbFwExgXrM2G4FLAczsXQQBohtWS86oLC/h7pkT2g2Rf330VZasq8lITSK5IunTeK8EfkJwiu6D7v49M7sDWOLu88Izr34BlBMcDfiauz/V1jZ1FpZkQ83+Oh5euJ4fP9P6ZIwAp/Uq5trzhzN5RD9NyCg5Ja9O402KAkSy6cLvP0P17o4Nmhca3PUJ3TpXsivfTuMV6XJuu2Jsh9dpcJ2tJV2TAkSkA6IMqrfEgYvv/CuPLNqgIJEuQ4ewRDpo3vLN3Dxn+QnnpHfE1DGVjB/Ui/oG2Lr3ENPPHsilY+PPFizSHo2BoACR3FCzv46X36rhgReqWLElPafw9i4pZOZ5Q6k9XM+4gb35wLhTNfguaaMAQQEiuadq2z6eXLmVV9bt5IU176R127dcOppbLj8jrduU7kkBggJEclvjhIz3PVfFonW70rLNySP6MvefL0jLtqT70llYIjmucULGuf98Ad+44sy0bHPRul26SFFykgJEJCGzpo7ib9+8jOunDOv0tuYs3piGikTSS4ewRDKgcdD9jbf3cPBwAwCvrK9hxZb9kdYf2KuEC0ZXcu6wfhpcl1g0BoICRLqWqm37eKlqB3sOHOGVdTtZsHZnpPUuOeNkLjr9ZC4c3Z/RA9qecl4EFCCAAkS6tut/tZDn13RsvGN4v55MHNqHwgJjcN9SjjQcpbS4iA+OO1XhIsckESC6uYZIDikuKuzwOut3HmT9zoMnvH7nU28yfmAFl79rAPvqGpg2bgCTRlSmo0wRQHsgIjnlzife4L7n1ya2/X49C/nYpKE0OAqUbkaHsFCASNdWtW0fl931Ysber39ZD6affRrnDKtkyqhKDc53YQoQFCDS9X3rj6/z0MtNp+2OH1jBoSMNVO04kPh7nzmgnC9MHclHzhnSfmPJKwoQFCDSPVRt28fyTbuZMKTPsYHwqm37+N2iDTyxcitb9iQ7o29pD+Ohz03WIa4uRAGCAkQEmubfqt51gNVb97F00x4KgYY0v8/Y08r5881T07xVyQYFCAoQkZbU7K+jetdBBvftybod+/nJM2t46a0aCoCjndx2aRHMmDiYgoICPjphoPZK8pQCBAWISFSNoVJWXMjzq7fz/OrtrKs5wObdhzq13f5lPRgzoJx3D+rDhWNOZmDvk6g93MDgvj01CJ/DFCAoQEQ6q/Hw1/w177BwXbQr36MoMJg4pDc3XjyaCUP7HtsjUqjkBgUIChCRdKrZX8fTK7fyqwXrWLM9vf+vCgAz439NO4NZU0elddvScQoQFCAiSanato/rHlyUyBlep1YU8+kpw/nguFPpW1asvZMsUICgABFJ2ofveZHXN6fnNr2tMcAM/unC4XzjH8Yl+l4SUICgABHJhCXrapizeCN7D9az71A9L6dxrKS5vqWFXD9lBJt2HWRArxI+OnGwJoFMQNYCxMxKgKuA4aRMwOjud6SzmCgUICKZV7O/jseXVjPv1c28lvDeCcDo/qWcP6qSAoy17+ynqKCAkSeXclqfUg7XNzCsslxTr3RQNgPkCWAP8DdSrlVy9x+ls5goFCAi2dV43/fNuw6ys7aO7Xvr+MvKrWzfdzjjtUwYXEG/8hL6l5aw7/ARhvQtY2CfnrpPSguyGSAr3H18Ot84LgWISG5qvDlWSVEhT6/axl9X78hqPZec0Z8HPzv5WF1v7z7E1r2HmH72QC4de2pWa8uGbAbIbOAed389nW8ehwJEJD/U7K/jFy+8xUML13PgSHbGWlu7Er+kEGacPYijwIHDRzhUf5SK4iI27znEmFPKueqcwWzdW8eGmv1d5nBZNgNkFTAaWAfUEZxE4e5+VjqLiUIBIpJ/lqyr4clVWynA+Pn8ddkuJ5ZJQ3tz1uA+HDzSwP5D9RQVGrV19eyuPUID8Onzh+b0LMbZDJBhLb3u7hvSWUwUChCR/DZv+WZunbucBgeHRCaBzJYeBu8dXUnFSUWMOqWCbXsPsWnnQU6uKAZgy+6D9O7Zg1N79aR690F69yxi9CkVx24/nDqnWbr3eLJ6Gq+ZnQ28L3w6391fjbDONOBugu/IL939+y20+TjwbYLv0qvu/sm2tqkAEcl/qfN0Nc6jtav2MA++tJY3t+9jZGU5xT0KKMB4ctVWtmVhgD7TBvcuYfOeOsyC5+NOq6BHodG/vISxA3vTo9DY+M4BVm7byyllJRysb6BnYSHbD9Qxsl8ZvUp7UICxo/bQsZMJxg/sRdX2/azYsofvXTXxsB9tSGsqRd0DuRn4PPA/4UsfBWa7+z1trFMIvAlcDlQDi4Fr3H1VSpsxwKPAJe6+y8xOcfftbdWiABHpfhrvj9K3tAdLN+5izbZ9gFFUCOvfqWXVVv1MaM/GH13F0SOHLJ3bLGq/CQA3AJPdvRbAzH4AvAy0GiDAeUCVu68N15kDzABWpbT5PHCfu+8CaC88RKR7Gj2g4thpuS2dQdW4R/PDJ95gwVvJXfQox4saIMbxhykbwtfaMgjYlPK8GpjcrM3pAGa2gOAw17fd/YkT3txsFjALoLi4OGLJItJdVJaXUFlewsOfn3LstN09B46wY38d4wb25gPjTmVX7WGeXLmVVVt2s21fHSMry487C+ut7bVs2Hkw2x8lr0QNkP8DLDKzx8PnHwF+1c46LQVM8+NlRcAY4GJgMDDfzMa7++7jVnKfDcyG4BBWxJpFpBtK3VtJVVle0u7FhY1T3QOceWoFSzfuYuXmvRw80sDAPifRt7T4uLOw1mzfz4otyV+Zn6siBYi7/9jMngcuJAiGz7r7snZWqwZSz2kbDGxpoc1Cdz8CrDOz1QSBsjhKXSIi6dQ8fKJccNg4Jf6Cqh3sO1QPEOksrEcWb+RIZ28XmWVtDqKbWS9332tm/Vpa7u6tHmw0syKCQfRLgc0EofBJd1+Z0mYawcD6Z8ysP7AMmODuNa1tV4PoItJV/GHpJuYuqaZvaREj+pdR3wAbd9VyctlJbN17iHf2H4p8FlZN7WEWrG19/Ccbg+i/Az5EMAdWatJY+Hxkayu6e72Z3QQ8STC+8aC7rzSzO4Al7j4vXPaB8ELFBuCrbYWHiEhX8pFzhqT14sPGkwmO1DewaN1O/uupN08YN0gnTecuItJFnfH1/0td+CM+iT2QgiiNzOy9ZlYWPr7WzH5sZkPTWYiIiKRZpJ/wyW/+fuBAeDX614ANwG8Tq0pERDrNEx6kjxog9R4c65oB3O3udwOabF9EJIclPUIR9TqQfWb2deBa4KJwmpIeyZUlIiK5LuoeyCcIpnG/wd23ElxlfmdiVYmISKclfZlJ1AsJtwI/Tnm+EXgoqaJERKTz0nrKVQvaDBAze8ndLzSzfbRwHYi790q0OhERiS3pizTaDBB3vzD8WwPmIiJ5Juk9kKjXgZxvZhUpz8vNrPnMuiIikkOSvtNjR64D2Z/y/ED4moiI5KikD2FFDRDzlDlP3P0o0U8BFhGRLMiJQ1jAWjP7spn1CP/cDKxNsjAREemcXNkD+QJwAcG07I13FpyVVFEiIpL7ol4Hsh2YmXAtIiKSR6KehXW6mT1rZivC52eZ2TeTLU1ERDqjR8KDIFEPYf0C+DpwBMDdX0N7JCIiOe2SM09JdPtRA6TU3V9p9lp9uosREZH0+Y+rzkp0+1ED5B0zG0U4qG9mVwNvJ1aViIh0WmV5CT+dOSGx7Ue6pa2ZjQRmE5yJtQtYB3zK3TckVlkrdEtbEZGOqdlfR/+Knofcj/ZM53bbPQvLzAqASe5+WXhb2wJ335fOIkREJDmV5SUkcX/Cdg9hhVed3xQ+rlV4iIgIRB8DedrM/s3MhphZv8Y/iVYmIiI5LeoYyDpauCre3UcmUVRbNAYiItJxZnbA3cvSuc2oEyKOBW4ELiQIkvnAA+ksRERE8kvUPZBHgb3Aw+FL1wB93P3jCdbWIu2BiIh0XDb3QM5w97NTnj9nZq+msxAREckvUQfRl5nZ+Y1PwrsRLkimJBERyQdRD2G9AZwBbAxfGgq8ARwF3N2TvV4+hQ5hiYh0XDYPYU1L55uKiEj+i3o/kIxPWSIiIrkt6hiIiIjIcRINEDObZmarzazKzG5ro93VZuZmNinJekREJH0SCxAzKwTuA64guBDxGjMb20K7CuDLwKKkahERkfRLcg/kPKDK3de6+2FgDjCjhXbfAX4IHEqwFhERSbMkA2QQsCnleXX42jFmNhEY4u5/amtDZjbLzJaY2ZL6et0IUUQkF0Q9jTeOlm7nfuyik/A+I3cB17e3IXefTXBDK8rKytq/cEVERBKX5B5INTAk5flgYEvK8wpgPPC8ma0HzgfmaSBdRCQ/JBkgi4ExZjbCzIqBmcC8xoXuvsfd+7v7cHcfDiwEprv7kgRrEhGRNEksQNy9nuBOhk8STHvyqLuvNLM7zGx6Uu8rIiKZEWkurFyiubBERDouibmwdCW6iIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiSXRADGzaWa22syqzOy2FpbfamarzOw1M3vWzIYlWY+IiKRPYgFiZoXAfcAVwFjgGjMb26zZMmCSu58F/B74YVL1iIhIeiW5B3IeUOXua939MDAHmJHawN2fc/cD4dOFwOAE6xERkTRKMkAGAZtSnleHr7XmBuAvLS0ws1lmtsTMltTX16exRBERiasowW1bC695iw3NrgUmAVNbWu7us4HZAGVlZS1uQ0REMivJAKkGhqQ8Hwxsad7IzC4DbgemuntdgvWIiEgaJXkIazEwxsxGmFkxMBOYl9rAzCYCPwemu/v2BGsREZE0SyxA3L0euAl4EngDeNTdV5rZHWY2PWx2J1AOPGZmy81sXiubExGRHGPu+TWkUFZW5rW1tdkuQ0Qkr5jZAXcvS+c2dSW6iIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisSQaIGY2zcxWm1mVmd3WwvISM5sbLl9kZsOTrEdERNInsQAxs0LgPuAKYCxwjZmNbdbsBmCXu48G7gJ+kFQ9IiKSXknugZwHVLn7Wnc/DMwBZjRrMwP4Tfj498ClZmYJ1iQiImlSlOC2BwGbUp5XA5Nba+Pu9Wa2B6gE3kltZGazgFkpzw8kUXAeKgLqs11EjlBfNFFfNFFfNOmZ7g0mGSAt7Ul4jDa4+2xgNoCZLXH3SZ0vL/+pL5qoL5qoL5qoL5qY2ZJ0bzPJQ1jVwJCU54OBLa21MbMioDewM8GaREQkTZIMkMXAGDMbYWbFwExgXrM284DPhI+vBv7q7ifsgYiISO5J7BBWOKZxE/AkUAg86O4rzewOYIm7zwN+BfzWzKoI9jxmRtj07KRqzkPqiybqiybqiybqiyZp7wvTL/wiIhKHrkQXEZFYFCAiIhJLzgaIpkFpEqEvbjWzVWb2mpk9a2bDslFnJrTXFyntrjYzN7MuewpnlL4ws4+H342VZva7TNeYKRH+jww1s+fMbFn4/+TKbNSZNDN70My2m9mKVpabmf007BAaZ1IAAAQQSURBVKfXzOycTr2hu+fcH4JB97eAkUAx8CowtlmbG4EHwsczgbnZrjuLffF+oDR8/MXu3BdhuwrgRWAhMCnbdWfxezEGWAb0DZ+fku26s9gXs4Evho/HAuuzXXdCfXERcA6wopXlVwJ/IbgG73xgUWfeL1f3QDQNSpN2+8Ldn3P3xqvzFxJcc9MVRfleAHwH+CFwKJPFZViUvvg8cJ+77wJw9+0ZrjFTovSFA73Cx7058Zq0LsHdX6Tta+lmAA95YCHQx8xOi/t+uRogLU2DMqi1Nu5eDzROg9LVROmLVDcQ/IbRFbXbF2Y2ERji7n/KZGFZEOV7cTpwupktMLOFZjYtY9VlVpS++DZwrZlVA38GvpSZ0nJOR3+etCnJqUw6I23ToHQBkT+nmV0LTAKmJlpR9rTZF2ZWQDCr8/WZKiiLonwviggOY11MsFc638zGu/vuhGvLtCh9cQ3wa3f/kZlNIbj+bLy7H02+vJyS1p+buboHomlQmkTpC8zsMuB2YLq712Wotkxrry8qgPHA82a2nuAY77wuOpAe9f/IH939iLuvA1YTBEpXE6UvbgAeBXD3l4GTgP4ZqS63RPp5ElWuBoimQWnSbl+Eh21+ThAeXfU4N7TTF+6+x937u/twdx9OMB403d3TPolcDojyf+QPBCdYYGb9CQ5prc1olZkRpS82ApcCmNm7CAJkR0arzA3zgOvCs7HOB/a4+9txN5aTh7A8uWlQ8k7EvrgTKAceC88j2Oju07NWdEIi9kW3ELEvngQ+YGargAbgq+5ek72qkxGxL/4V+IWZfYXgkM31XfEXTjN7hOCQZf9wvOffgR4A7v4AwfjPlUAVcAD4bKferwv2oYiIZECuHsISEZEcpwAREZFYFCAiIhKLAkRERGJRgIiISCwKEJEMMrPhjTOlmtnFZtbVp1yRLkwBIhJBeOGV/r+IpNB/CJFWhHsLb5jZz4ClwKfN7GUzW2pmj5lZedjuPWb2/8zsVTN7xcwqwnXnh22XmtkF2f00IumnABFp2xnAQ8DlBPMpXebu5wBLgFvDqTPmAje7+9nAZcBBYDtwedj2E8BPs1G8SJJycioTkRyywd0XmtmHCG5EtCCcLqYYeJkgYN5298UA7r4XwMzKgHvNbALBNCKnZ6N4kSQpQETaVhv+bcDT7n5N6kIzO4uWp8P+CrANOJtgT78r39xKuikdwhKJZiHwXjMbDWBmpWZ2OvB3YKCZvSd8vSLl9gJvh/eb+DTBJH8iXYoCRCQCd99BcKOqR8zsNYJAOTO8heongHvM7FXgaYKpwn8GfMbMFhIcvqptccMieUyz8YqISCzaAxERkVgUICIiEosCREREYlGAiIhILAoQERGJRQEiIiKxKEBERCSW/w9v6VhC2DTcyQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plot_precision_recall_curve(prc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by precision–recall curves\n",
    "\n",
    "With precision–recall curves, we get a generalized perspective on F1 scores (and we could weight precision and recall differently to achieve the effects of `beta` for F scores more generally). These curves can be used, not only to assess a system, but also to identify an optimal decision boundary given external goals.    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of precision–recall curves\n",
    "\n",
    "* Most implementations are limited to binary problems. The basic concepts are defined for multi-class problems, but it's very difficult to understand the resulting hyperplanes.\n",
    "\n",
    "* There is no single statistic that does justice to the full curve, so this metric isn't useful on its own for guiding development and optimization. Indeed, opening up the decision threshold in this way really creates another hyperparameter that one has to worry about!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to precision–recall curves\n",
    "\n",
    "* The [Receiver Operating Characteristic (ROC) curve](#Receiver-Operating-Characteristic-(ROC)-curve) is superficially similar to the precision–recall, but it compares recall with the false positive rate.\n",
    "\n",
    "* [Average precision](#Average-precision), covered next, is a way of summarizing these curves with a single number."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Average precision\n",
    "\n",
    "Average precision is a method for summarizing the precision–recall curve. It does this by calculating the average precision weighted by the change in recall from step to step along the curve. \n",
    "\n",
    "Here is the calculation in terms of the data structures returned by `precision_recall_curve` above, in which (as in sklearn) the largest recall value is first:\n",
    "\n",
    "$$\\textbf{average-precision}(r, p) = \\sum_{i=1}^{n} (r_{i} - r_{i+1})p_{i}$$\n",
    "\n",
    "where $n$ is the increasing sequence of thresholds and the precision and recall vectors $p$ and $r$ are of length $n+1$. (We insert a final pair of values $p=1$ and $r=0$ in the precision–recall curve calculation, with no threshold for that point.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "def average_precision(p, r):\n",
    "    total = 0.0\n",
    "    for i in range(len(p)-1):\n",
    "        total += (r[i] - r[i+1]) * p[i]\n",
    "    return total"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAAEjCAYAAAAc4VcXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3de5xVdb3/8ddnZpiRuXAbFOV+VQNSUBIxE/NS6CmooxWWmeWJUx5L85z6Wfbo9LM6p/KUmZpG5a/sYYJ2jsWjX3nNC/IThAAVMGTkOiAXh/sAAzN8fn+sNcxmmMuaNXvty8z7+XjwYO+9vmvtz/6ymfes9V3ru8zdERER6aiCbBcgIiL5SQEiIiKxKEBERCQWBYiIiMSiABERkVgUICIiEosCRLoVM+tjZjeGjy82sz8l8B7Xm9m9HVxnvZn1b+H1b5vZv6WvOpH0UYBId9MHuLEjK5hZYUK1iOQ1BYh0N98HRpnZcuBOoNzMfm9mfzezh83M4NgewbfM7CXgY2Y2ysyeMLO/mdl8MzszbPcxM1thZq+a2Ysp7zMwbL/GzH7Y+KKZXWNmr4fr/KClAs3sdjNbbWbPAGck1REinVWU7QJEMuw2YLy7TzCzi4E/AuOALcAC4L3AS2HbQ+5+IYCZPQt8wd3XmNlk4GfAJcC3gA+6+2Yz65PyPhOAiUAdsNrM7gEagB8A5wK7gKfM7CPu/ofGlczsXGBmuG4RsBT4W/q7QaTzFCDS3b3i7tUA4V7JcJoCZG74ejlwAfBYuIMCUBL+vQD4tZk9CvxPynafdfc94fqrgGFAJfC8u+8IX38YuAj4Q8p67wMed/cDYZt5afukImmmAJHuri7lcQPH/5+oDf8uAHa7+4TmK7v7F8I9kn8AlptZY5uWtmvN12+FJqiTvKAxEOlu9gEVHVnB3fcC68zsYwAWODt8PMrdF7n7t4B3gCFtbGoRMNXM+ocD89cALzRr8yLwUTPraWYVwIc7UqtIJmkPRLoVd68xswVmtgI4CGyLuOqngPvN7JtAD2AO8Cpwp5mNIdi7eDZ87YQ9lfC93zazrwPPhe3/7O5/bNZmqZnNBZYDG4D5Hf2MIplims5dRETi0CEsERGJRQEiIiKxKEBERCSWrAeImU0Lr7qtMrPbWlheYmZzw+WLzGx4yrKvh6+vNrMPprz+oJltDwdK80LcfjCz4WZ20MyWh38eSFmn2Mxmm9mb4ZXWV2XuE8UXoS8uMrOlZlZvZlc3W/aEme1uPseVmd0Ubs9bmnMqV3WmL8Llvcxsc+rcXGb2PTPbZGb7k64/neL2hZm9P+X/x3IzO2RmH2m27j351B8R+uJWM1tlZq+Z2bNmNixl2WcsmCFhjZl9poV150X+2enuWfsDFAJvASOBYoIzWMY2a3Mj8ED4eCYwN3w8NmxfAowIt1MYLrsIOAdYkc3Pl6F+GN7a5wT+N/Dd8HEB0D/bnzVNfTEcOAt4CLi62bJLCU59/VOz1yeG663Ph35IR1+Ey+8Gfgfcm/La+cBpwP5sf8ZM9kXYph+wEyhNeW0S8Nt86Y+IffH+xs8IfDHl50U/YG34d9/wcd+U9f4x/L5E+tmZ7T2Q84Aqd1/r7ocJTo2c0azNDOA34ePfA5dacDnwDGCOu9e5+zqgKtwe7v4iwZckX3SmH9ryOeA/Adz9qLu/k8aak9JuX7j7end/DTjafGV3f5bgWo/mry9z9/XJlJyYTvVFOC3KAOCpZussdPe3kys7EZ3qixRXA3/xpiv9CwnmRPtaMmUnIkpfPNf4GYGFwODw8QeBp919p7vvAp4GpsGxGRduBb4btZBsB8ggYFPK8+rwtRbbuHs9sIdgSogo6+aLzvQDwAgzW2ZmL5jZ+yCYtjxc9p1wt/4xMxuQ2CdIn67079pZsfvCzAqAHwFfTaCubEjX92Im8EjK85uAeXkWqB3tixuAv0RY9zsE35kDRJTtAGnpN+jmF6a01ibKuvmiM/3wNjDU3ScS/PbwOzPrRXCR6GBggbufA7wM/Ff6Sk5MV/p37azO9MWNBBcqbmq3ZX7o9PfCzE4D3g08GT4fCHwMuKfT1WVW5L4ws2sJDtHd2da6FkzBM9rdH+9IIdkOkGqOn/phMMGsqC22MbMioDfB4ako6+aL2P0QHsKrAXD3vxEcGz0dqCH4TaLxC/EYwbhQrutK/66d1Zm+mALcZGbrCX5xuM7Mvp/e8jIqHd+LjxNMVHkkfD4RGA1Uhf1UamZVnS00AyL1hZldBtwOTHf3unbWnQKcG/bDS8DpZvZ8u5VkeTCoiGAQZwRNg0HjmrX5F44fPH40fDyO4wfR1xIOonvTgFq+DKJ3ph9OpunkgZHAZqBf+HwOcEn4+HrgsWx/1nT0RUrbX9PywPHFNBtET1m2nvwZRO90X6T829/bwut5MWicxu/FQuD9bbxHXvRHxJ8XEwl+mRzT7PV+wDqCAfS+4eN+zdpE/tmZC51xJfBm+GFvD1+7gyA1AU4i+O25CngFGJmy7u3hequBK1Jef4Tg0M4RgsS9IdufM6l+AK4CVoZfoqXAh1O2OYxgcr7XCOZpGprtz5mmvnhP+O9aS7CntTJl3fnADoJ5rqoJ7tUB8OXweT3Bb1y/zPbnTLovUrZxPcefhfXDcJ2j4d/fzvbnzMD3YjjBL1cFbWw/LwIkYl88QzDP2/Lwz7yUdT8X/hypAj7bwraHEzFANBeWiIjEku0xEBERyVMKEBERiUUBIiIisShAREQklrwLEDOble0acoX6oon6oon6oon6okkSfZF3AQLoC9FEfdFEfdFEfdFEfdFEASIiIrkh764DMTMvLS3Ndhk5ob6+nqKiomyXkRPUF03UF03UF00OHDjg7p7WnYa869nS0lJqa2uzXYaISF4xs4Pp3qYOYYmISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhJLYvcDMbMHgQ8B2919fAvLDbgbuBI4AFzv7kvb227DUee+v75JcVEhh+sbTvi7X1kJADtr61ptk/Tf2awhXz5/v7ISBvUtZdzAXlSWl6TjKyciGZbkDaV+DdwLPNTK8iuAMeGfycD94d9tOlx/lDufWpOmEiXbCg3u+sQEpk8YlO1SRKSDEjuE5e4vAjvbaDIDeMgDC4E+ZnZaUvVIbmpwuGXucmr212W7FBHpoGyOgQwCNqU8rw5fO4GZzTKzJWa2xI82ZKQ4yZyjDi+/VZPtMkSkg7IZINbCa95SQ3ef7e6T3H2SFRQmXJZkw4aa/dkuQUQ6KJsBUg0MSXk+GNiSpVokyzbtOpjtEkSkg5IcRG/PPOAmM5tDMHi+x93fbm+l4qICvvqBMXl/FlJXfO8oNew9WM/P56874d91zuJq3j2oD586f1h6vl0ikjhzb/GoUec3bPYIcDHQH9gG/DvQA8DdHwhP470XmEZwGu9n3X1Je9stKyvz2traRGqW5L26aTdX37+AI0dPXFZo8Mrtl+m0XpEEmNkBdy9L5zYT2wNx92vaWe7AvyT1/pKbBvftiVkBcGKCNDg8vXIrMydrL0QkH+hKdMmoyvIS/n362FaX3/b4Cr71x9czWJGIxKUAkYz71ORhzDi79Ut+Hnp5I1Xb9mWwIhGJQwEiWTFxaN82lz+5cmuGKhGRuBQgkhUXju7f5vI7n3qThxdtyFA1IhKHAkSyYvSACq6bMrTNNrc/voKHFypERHKVAkSy5o4Z7+Y/P3rCRM3H+eYfVmieLJEcpQCRrPrAuFMpbGlSm5AD//TrxQoRkRykAJGsqiwv4a5PTKCgjRBZVr2HSd99hnnLN2euMBFplwJEsm76hEE8dctFbbZx4JY5mvZdJJcoQCQnjB5QwffaGQ85Ctwyd1lmChKRdilAJGd8avIwvnHlmW22mb+mRhcZiuSIxCZTTIomU+z6Hl64gdv/sKLV5ReOqmTmeUPo1bNY91QXiSiJyRQVIJKTqrbt47K7Xmy3ne6pLhJNXs3GK9IZowdU8MWpI7n/hbVttmtwuHnOcgb2PolJIyqp2V/Hyi172HvwCID2UkQSpD0QyVk1++s497vPRG4/rO9JbNx16IT7Ihvwj+cM4otTRzF6QEVaaxTJF0nsgWgQXXJWZXkJ118Q/d4gG1oIDwhOAf7vpZu57K4XNVW8SBopQCSnfemSMWndnqaKF0kfBYjktMryEn46c0Kb05101EtVO9K3MZFuTIPokvOmTxjEe0f3PzY4/vjyzTz7RvwQeGPz3jRWJ9J9aRBd8lLVtn385Jk3+dPr8W48deaAMh7+/BSdnSXdhq4DQQEix2s8bXfzroPsrK2jX1kJg/qWMm5gL37+wlvMnr+uzfV/OlPXkEj3oABBASLRvfjmDq578JU22xQavHL7ZdoTkS5Pp/GKdMC4gb3aHXxvcFi5ZU9mChLpYhQg0mU13mukqJ1v+ZtbdVqvSBw6hCVdXs3+On45f22r06IYcOvlp/PJyUN1KEu6LI2BoACR+H7y9Gp+8mxVq8sLDK4cfyrXXzCcA0caANM8WtJlaDJFkU440nC0zeVHHf70+tbjTg024OtXnMmsqaMSrk4k/2gMRLqN0uKO/77kwH/85e9c98uFup2uSDMKEOk2Pjju1NjrvlhVw7nffYaHF21IY0Ui+U0BIt3G6AEVXDdlaKe2cfvjK3h4oUJEBDSILt1Q1bZ93P98Ff+9bEus9QuAxd/UxYeSX/LuLCwzmwbcDRQCv3T37zdbPhT4DdAnbHObu/+5rW0qQCRdUqdB+evft/P0G9sjr/vQ597DRaefkmB1IumVVwFiZoXAm8DlQDWwGLjG3VeltJkNLHP3+81sLPBndx/e1nYVIJKU1Nvh3vPXNaze1vr3bMrIvtz7yXO1FyJ5I9+mMjkPqHL3te5+GJgDzGjWxoFe4ePeQLxjCiJpUFlewkWnn8KHzh7Ek1+5mJ98/KxW2768dhfnfvcZ5i3fnMEKRXJLkgEyCNiU8rw6fC3Vt4Frzawa+DPwpZY2ZGazzGyJmS2pr69PolaRE3zknCGMH9j2PdRvnrNcp/dKt5VkgLQ0jV3z42XXAL9298HAlcBvzeyEmtx9trtPcvdJRUW69lEy5wtTR7e53IFb5i7LTDEiOSbJAKkGhqQ8H8yJh6huAB4FcPeXgZOA/gnWJNIhU0ZVtvibUKr5a2p0n3XplpIMkMXAGDMbYWbFwExgXrM2G4FLAczsXQQBohtWS86oLC/h7pkT2g2Rf330VZasq8lITSK5IunTeK8EfkJwiu6D7v49M7sDWOLu88Izr34BlBMcDfiauz/V1jZ1FpZkQ83+Oh5euJ4fP9P6ZIwAp/Uq5trzhzN5RD9NyCg5Ja9O402KAkSy6cLvP0P17o4Nmhca3PUJ3TpXsivfTuMV6XJuu2Jsh9dpcJ2tJV2TAkSkA6IMqrfEgYvv/CuPLNqgIJEuQ4ewRDpo3vLN3Dxn+QnnpHfE1DGVjB/Ui/oG2Lr3ENPPHsilY+PPFizSHo2BoACR3FCzv46X36rhgReqWLElPafw9i4pZOZ5Q6k9XM+4gb35wLhTNfguaaMAQQEiuadq2z6eXLmVV9bt5IU176R127dcOppbLj8jrduU7kkBggJEclvjhIz3PVfFonW70rLNySP6MvefL0jLtqT70llYIjmucULGuf98Ad+44sy0bHPRul26SFFykgJEJCGzpo7ib9+8jOunDOv0tuYs3piGikTSS4ewRDKgcdD9jbf3cPBwAwCvrK9hxZb9kdYf2KuEC0ZXcu6wfhpcl1g0BoICRLqWqm37eKlqB3sOHOGVdTtZsHZnpPUuOeNkLjr9ZC4c3Z/RA9qecl4EFCCAAkS6tut/tZDn13RsvGN4v55MHNqHwgJjcN9SjjQcpbS4iA+OO1XhIsckESC6uYZIDikuKuzwOut3HmT9zoMnvH7nU28yfmAFl79rAPvqGpg2bgCTRlSmo0wRQHsgIjnlzife4L7n1ya2/X49C/nYpKE0OAqUbkaHsFCASNdWtW0fl931Ysber39ZD6affRrnDKtkyqhKDc53YQoQFCDS9X3rj6/z0MtNp+2OH1jBoSMNVO04kPh7nzmgnC9MHclHzhnSfmPJKwoQFCDSPVRt28fyTbuZMKTPsYHwqm37+N2iDTyxcitb9iQ7o29pD+Ohz03WIa4uRAGCAkQEmubfqt51gNVb97F00x4KgYY0v8/Y08r5881T07xVyQYFCAoQkZbU7K+jetdBBvftybod+/nJM2t46a0aCoCjndx2aRHMmDiYgoICPjphoPZK8pQCBAWISFSNoVJWXMjzq7fz/OrtrKs5wObdhzq13f5lPRgzoJx3D+rDhWNOZmDvk6g93MDgvj01CJ/DFCAoQEQ6q/Hw1/w177BwXbQr36MoMJg4pDc3XjyaCUP7HtsjUqjkBgUIChCRdKrZX8fTK7fyqwXrWLM9vf+vCgAz439NO4NZU0elddvScQoQFCAiSanato/rHlyUyBlep1YU8+kpw/nguFPpW1asvZMsUICgABFJ2ofveZHXN6fnNr2tMcAM/unC4XzjH8Yl+l4SUICgABHJhCXrapizeCN7D9az71A9L6dxrKS5vqWFXD9lBJt2HWRArxI+OnGwJoFMQNYCxMxKgKuA4aRMwOjud6SzmCgUICKZV7O/jseXVjPv1c28lvDeCcDo/qWcP6qSAoy17+ynqKCAkSeXclqfUg7XNzCsslxTr3RQNgPkCWAP8DdSrlVy9x+ls5goFCAi2dV43/fNuw6ys7aO7Xvr+MvKrWzfdzjjtUwYXEG/8hL6l5aw7/ARhvQtY2CfnrpPSguyGSAr3H18Ot84LgWISG5qvDlWSVEhT6/axl9X78hqPZec0Z8HPzv5WF1v7z7E1r2HmH72QC4de2pWa8uGbAbIbOAed389nW8ehwJEJD/U7K/jFy+8xUML13PgSHbGWlu7Er+kEGacPYijwIHDRzhUf5SK4iI27znEmFPKueqcwWzdW8eGmv1d5nBZNgNkFTAaWAfUEZxE4e5+VjqLiUIBIpJ/lqyr4clVWynA+Pn8ddkuJ5ZJQ3tz1uA+HDzSwP5D9RQVGrV19eyuPUID8Onzh+b0LMbZDJBhLb3u7hvSWUwUChCR/DZv+WZunbucBgeHRCaBzJYeBu8dXUnFSUWMOqWCbXsPsWnnQU6uKAZgy+6D9O7Zg1N79aR690F69yxi9CkVx24/nDqnWbr3eLJ6Gq+ZnQ28L3w6391fjbDONOBugu/IL939+y20+TjwbYLv0qvu/sm2tqkAEcl/qfN0Nc6jtav2MA++tJY3t+9jZGU5xT0KKMB4ctVWtmVhgD7TBvcuYfOeOsyC5+NOq6BHodG/vISxA3vTo9DY+M4BVm7byyllJRysb6BnYSHbD9Qxsl8ZvUp7UICxo/bQsZMJxg/sRdX2/azYsofvXTXxsB9tSGsqRd0DuRn4PPA/4UsfBWa7+z1trFMIvAlcDlQDi4Fr3H1VSpsxwKPAJe6+y8xOcfftbdWiABHpfhrvj9K3tAdLN+5izbZ9gFFUCOvfqWXVVv1MaM/GH13F0SOHLJ3bLGq/CQA3AJPdvRbAzH4AvAy0GiDAeUCVu68N15kDzABWpbT5PHCfu+8CaC88RKR7Gj2g4thpuS2dQdW4R/PDJ95gwVvJXfQox4saIMbxhykbwtfaMgjYlPK8GpjcrM3pAGa2gOAw17fd/YkT3txsFjALoLi4OGLJItJdVJaXUFlewsOfn3LstN09B46wY38d4wb25gPjTmVX7WGeXLmVVVt2s21fHSMry487C+ut7bVs2Hkw2x8lr0QNkP8DLDKzx8PnHwF+1c46LQVM8+NlRcAY4GJgMDDfzMa7++7jVnKfDcyG4BBWxJpFpBtK3VtJVVle0u7FhY1T3QOceWoFSzfuYuXmvRw80sDAPifRt7T4uLOw1mzfz4otyV+Zn6siBYi7/9jMngcuJAiGz7r7snZWqwZSz2kbDGxpoc1Cdz8CrDOz1QSBsjhKXSIi6dQ8fKJccNg4Jf6Cqh3sO1QPEOksrEcWb+RIZ28XmWVtDqKbWS9332tm/Vpa7u6tHmw0syKCQfRLgc0EofBJd1+Z0mYawcD6Z8ysP7AMmODuNa1tV4PoItJV/GHpJuYuqaZvaREj+pdR3wAbd9VyctlJbN17iHf2H4p8FlZN7WEWrG19/Ccbg+i/Az5EMAdWatJY+Hxkayu6e72Z3QQ8STC+8aC7rzSzO4Al7j4vXPaB8ELFBuCrbYWHiEhX8pFzhqT14sPGkwmO1DewaN1O/uupN08YN0gnTecuItJFnfH1/0td+CM+iT2QgiiNzOy9ZlYWPr7WzH5sZkPTWYiIiKRZpJ/wyW/+fuBAeDX614ANwG8Tq0pERDrNEx6kjxog9R4c65oB3O3udwOabF9EJIclPUIR9TqQfWb2deBa4KJwmpIeyZUlIiK5LuoeyCcIpnG/wd23ElxlfmdiVYmISKclfZlJ1AsJtwI/Tnm+EXgoqaJERKTz0nrKVQvaDBAze8ndLzSzfbRwHYi790q0OhERiS3pizTaDBB3vzD8WwPmIiJ5Juk9kKjXgZxvZhUpz8vNrPnMuiIikkOSvtNjR64D2Z/y/ED4moiI5KikD2FFDRDzlDlP3P0o0U8BFhGRLMiJQ1jAWjP7spn1CP/cDKxNsjAREemcXNkD+QJwAcG07I13FpyVVFEiIpL7ol4Hsh2YmXAtIiKSR6KehXW6mT1rZivC52eZ2TeTLU1ERDqjR8KDIFEPYf0C+DpwBMDdX0N7JCIiOe2SM09JdPtRA6TU3V9p9lp9uosREZH0+Y+rzkp0+1ED5B0zG0U4qG9mVwNvJ1aViIh0WmV5CT+dOSGx7Ue6pa2ZjQRmE5yJtQtYB3zK3TckVlkrdEtbEZGOqdlfR/+Knofcj/ZM53bbPQvLzAqASe5+WXhb2wJ335fOIkREJDmV5SUkcX/Cdg9hhVed3xQ+rlV4iIgIRB8DedrM/s3MhphZv8Y/iVYmIiI5LeoYyDpauCre3UcmUVRbNAYiItJxZnbA3cvSuc2oEyKOBW4ELiQIkvnAA+ksRERE8kvUPZBHgb3Aw+FL1wB93P3jCdbWIu2BiIh0XDb3QM5w97NTnj9nZq+msxAREckvUQfRl5nZ+Y1PwrsRLkimJBERyQdRD2G9AZwBbAxfGgq8ARwF3N2TvV4+hQ5hiYh0XDYPYU1L55uKiEj+i3o/kIxPWSIiIrkt6hiIiIjIcRINEDObZmarzazKzG5ro93VZuZmNinJekREJH0SCxAzKwTuA64guBDxGjMb20K7CuDLwKKkahERkfRLcg/kPKDK3de6+2FgDjCjhXbfAX4IHEqwFhERSbMkA2QQsCnleXX42jFmNhEY4u5/amtDZjbLzJaY2ZL6et0IUUQkF0Q9jTeOlm7nfuyik/A+I3cB17e3IXefTXBDK8rKytq/cEVERBKX5B5INTAk5flgYEvK8wpgPPC8ma0HzgfmaSBdRCQ/JBkgi4ExZjbCzIqBmcC8xoXuvsfd+7v7cHcfDiwEprv7kgRrEhGRNEksQNy9nuBOhk8STHvyqLuvNLM7zGx6Uu8rIiKZEWkurFyiubBERDouibmwdCW6iIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiSXRADGzaWa22syqzOy2FpbfamarzOw1M3vWzIYlWY+IiKRPYgFiZoXAfcAVwFjgGjMb26zZMmCSu58F/B74YVL1iIhIeiW5B3IeUOXua939MDAHmJHawN2fc/cD4dOFwOAE6xERkTRKMkAGAZtSnleHr7XmBuAvLS0ws1lmtsTMltTX16exRBERiasowW1bC695iw3NrgUmAVNbWu7us4HZAGVlZS1uQ0REMivJAKkGhqQ8Hwxsad7IzC4DbgemuntdgvWIiEgaJXkIazEwxsxGmFkxMBOYl9rAzCYCPwemu/v2BGsREZE0SyxA3L0euAl4EngDeNTdV5rZHWY2PWx2J1AOPGZmy81sXiubExGRHGPu+TWkUFZW5rW1tdkuQ0Qkr5jZAXcvS+c2dSW6iIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisShAREQkFgWIiIjEogAREZFYFCAiIhKLAkRERGJRgIiISCwKEBERiUUBIiIisSQaIGY2zcxWm1mVmd3WwvISM5sbLl9kZsOTrEdERNInsQAxs0LgPuAKYCxwjZmNbdbsBmCXu48G7gJ+kFQ9IiKSXknugZwHVLn7Wnc/DMwBZjRrMwP4Tfj498ClZmYJ1iQiImlSlOC2BwGbUp5XA5Nba+Pu9Wa2B6gE3kltZGazgFkpzw8kUXAeKgLqs11EjlBfNFFfNFFfNOmZ7g0mGSAt7Ul4jDa4+2xgNoCZLXH3SZ0vL/+pL5qoL5qoL5qoL5qY2ZJ0bzPJQ1jVwJCU54OBLa21MbMioDewM8GaREQkTZIMkMXAGDMbYWbFwExgXrM284DPhI+vBv7q7ifsgYiISO5J7BBWOKZxE/AkUAg86O4rzewOYIm7zwN+BfzWzKoI9jxmRtj07KRqzkPqiybqiybqiybqiyZp7wvTL/wiIhKHrkQXEZFYFCAiIhJLzgaIpkFpEqEvbjWzVWb2mpk9a2bDslFnJrTXFyntrjYzN7MuewpnlL4ws4+H342VZva7TNeYKRH+jww1s+fMbFn4/+TKbNSZNDN70My2m9mKVpabmf007BAaZ1IAAAQQSURBVKfXzOycTr2hu+fcH4JB97eAkUAx8CowtlmbG4EHwsczgbnZrjuLffF+oDR8/MXu3BdhuwrgRWAhMCnbdWfxezEGWAb0DZ+fku26s9gXs4Evho/HAuuzXXdCfXERcA6wopXlVwJ/IbgG73xgUWfeL1f3QDQNSpN2+8Ldn3P3xqvzFxJcc9MVRfleAHwH+CFwKJPFZViUvvg8cJ+77wJw9+0ZrjFTovSFA73Cx7058Zq0LsHdX6Tta+lmAA95YCHQx8xOi/t+uRogLU2DMqi1Nu5eDzROg9LVROmLVDcQ/IbRFbXbF2Y2ERji7n/KZGFZEOV7cTpwupktMLOFZjYtY9VlVpS++DZwrZlVA38GvpSZ0nJOR3+etCnJqUw6I23ToHQBkT+nmV0LTAKmJlpR9rTZF2ZWQDCr8/WZKiiLonwviggOY11MsFc638zGu/vuhGvLtCh9cQ3wa3f/kZlNIbj+bLy7H02+vJyS1p+buboHomlQmkTpC8zsMuB2YLq712Wotkxrry8qgPHA82a2nuAY77wuOpAe9f/IH939iLuvA1YTBEpXE6UvbgAeBXD3l4GTgP4ZqS63RPp5ElWuBoimQWnSbl+Eh21+ThAeXfU4N7TTF+6+x937u/twdx9OMB403d3TPolcDojyf+QPBCdYYGb9CQ5prc1olZkRpS82ApcCmNm7CAJkR0arzA3zgOvCs7HOB/a4+9txN5aTh7A8uWlQ8k7EvrgTKAceC88j2Oju07NWdEIi9kW3ELEvngQ+YGargAbgq+5ek72qkxGxL/4V+IWZfYXgkM31XfEXTjN7hOCQZf9wvOffgR4A7v4AwfjPlUAVcAD4bKferwv2oYiIZECuHsISEZEcpwAREZFYFCAiIhKLAkRERGJRgIiISCwKEJEMMrPhjTOlmtnFZtbVp1yRLkwBIhJBeOGV/r+IpNB/CJFWhHsLb5jZz4ClwKfN7GUzW2pmj5lZedjuPWb2/8zsVTN7xcwqwnXnh22XmtkF2f00IumnABFp2xnAQ8DlBPMpXebu5wBLgFvDqTPmAje7+9nAZcBBYDtwedj2E8BPs1G8SJJycioTkRyywd0XmtmHCG5EtCCcLqYYeJkgYN5298UA7r4XwMzKgHvNbALBNCKnZ6N4kSQpQETaVhv+bcDT7n5N6kIzO4uWp8P+CrANOJtgT78r39xKuikdwhKJZiHwXjMbDWBmpWZ2OvB3YKCZvSd8vSLl9gJvh/eb+DTBJH8iXYoCRCQCd99BcKOqR8zsNYJAOTO8heongHvM7FXgaYKpwn8GfMbMFhIcvqptccMieUyz8YqISCzaAxERkVgUICIiEosCREREYlGAiIhILAoQERGJRQEiIiKxKEBERCSW/w9v6VhC2DTcyQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plot_precision_recall_curve(prc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8043595615723373"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "average_precision(prc['precision'].values, prc['recall'].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of average precision\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by average precision\n",
    "\n",
    "This measure is very similar to the F1 score, in that it is seeking to balance precision and recall. Whereas the F1 score does this with the harmonic mean, average precision does it by making precision a function of recall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of average precision\n",
    "\n",
    "* An important weakness of this metric is cultural: it is often hard to tell whether a paper is reporting average precision or some interpolated variant thereof. The interpolated versions are meaningfully different and will tend to inflate scores. In any case, they are not comparable to the calculation defined above and implemented in `sklearn` as `sklearn.metrics.average_precision_score`.\n",
    "\n",
    "* Unlike for precision–recall curves, we aren't strictly speaking limited to binary classification here. Since we aren't trying to visualize anything, we can do these calculations for multi-class problems. However, then we have to decide on how the precision and recall values will be combined for each step: macro-averaged, weighted, or micro-averaged, just as with F$_{\\beta}$ scores. This introduces another meaningful design choice."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related\n",
    "\n",
    "* There are interpolated versions of this score, and some tasks/communities have even settled on specific versions as their standard metrics. All such measures should be approached with skepticism, since all of them can inflate scores artificially in specific cases. \n",
    "\n",
    "* [This blog post](https://roamanalytics.com/2016/09/07/stepping-away-from-linear-interpolation/) is an excellent discussion of the issues with linear interpolation. It proposes a step-wise interpolation procedure that is much less problematic. I believe the blog post and subsequent PR to `sklearn` led the `sklearn` developers to drop support for all interpolation mechanisms for this metric!\n",
    "\n",
    "* Average precision as defined above is a discrete approximation of the [area under the precision–recall curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc). This is a separate measure often referred to as \"AUC\". In calculating AUC for a precision–recall curve, some kind of interpolation will be done, and this will generally produce exaggerated scores for the same reasons that interpolated average precison does."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Receiver Operating Characteristic (ROC) curve\n",
    "\n",
    "The Receiver Operating Characteristic (ROC) curve for a class $k$ depicts recall the __false positive rate__ (FPR) for $k$ as a function of the __recall__ for $k$. For instance, suppose we focus on $k$ as the positive class $A$:\n",
    "\n",
    "$$\n",
    "\\begin{array}{r r r}\n",
    "\\hline\n",
    " & A & B \\\\\n",
    "\\hline\n",
    "A & \\text{TP}_{A} & \\text{FN}_{A}\\\\\n",
    "B & \\text{FP}_{A} & \\text{TN}_{A}\\\\\n",
    "\\hline\n",
    "\\end{array}\n",
    "$$\n",
    "\n",
    "The false positive rate is \n",
    "\n",
    "$$\n",
    "\\textbf{fpr}(A) = \\frac{\\text{FP}_{A}}{\\text{FP}_{A} + \\text{TN}_{A}}\n",
    "$$\n",
    "\n",
    "which is equivalent to 1 minus the recall for $B$ class. \n",
    "\n",
    "ROC curves are implemented in [sklearn.metrics.roc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html). \n",
    "\n",
    "The area under the ROC curve is often used as a summary statistic: see [sklearn.metrics.roc_auc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score).\n",
    "\n",
    "ROC is limited to binary problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of ROC\n",
    "\n",
    "* For individual ROC calculations of recall divided fpr: [0, $\\infty$), with larger better.\n",
    "* For ROC AUC: [0, 1], with 1 the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of ROC\n",
    "\n",
    "Recall that, for two classes $A$ and $B$, \n",
    "\n",
    "$$\n",
    "\\begin{array}{r r r}\n",
    "\\hline\n",
    " & A & B \\\\\n",
    "\\hline\n",
    "A & \\text{TP}_{A} & \\text{FN}_{A}\\\\\n",
    "B & \\text{FP}_{A} & \\text{TN}_{B}\\\\\n",
    "\\hline\n",
    "\\end{array}\n",
    "$$\n",
    "\n",
    "we can express ROC as comparing $\\textbf{recall}(A)$ with $1.0 - \\textbf{recall}(B)$.\n",
    "\n",
    "This reveals a point of contrast with scores based in precision and recall: the entire table is used, whereas precision and recall for a class $k$ ignore the $\\text{TN}_{k}$ values. Thus, whereas precision and recall for a class $k$ will be insensitive to changes in $\\text{TN}_{k}$, ROC will be affected by such changes. The following individual ROC calculations help to bring this out:\n",
    "\n",
    "$$\n",
    "\\begin{array}{r r r r r}\n",
    "\\hline\n",
    " & A & B & \\textbf{F1} & \\textbf{ROC}\\\\\n",
    "\\hline\n",
    "A & 15 & 10 & 0.21 & 0.90 \\\\\n",
    "B & 100 & {\\color{blue}{50}} & 0.48 & 0.83 \\\\\n",
    "\\hline\n",
    "\\end{array}\n",
    "\\qquad\n",
    "\\begin{array}{r r r r r}\n",
    "\\hline\n",
    " & A & B & \\textbf{F1} & \\textbf{ROC} \\\\\n",
    "\\hline\n",
    "A & 15 & 10 & 0.21 & 3.6 \\\\\n",
    "B & 100 & {\\color{blue}{500}} & 0.90 & 2.08  \\\\\n",
    "\\hline\n",
    "\\end{array}\n",
    "$$\n",
    "\n",
    "One might worry that the model on the right isn't better at identifying class $A$, even though its ROC value for $A$ is larger."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to ROC\n",
    "\n",
    "ROC-based analysis is superficially similar to precision–recall curves and average precision, but we should have no expectation that the results will align, particularly in the presence of class imbalances like the one sketched above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regression metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Mean squared error\n",
    "\n",
    "The [mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) is a summary of the distance between predicted and actual values:\n",
    "\n",
    "$$\n",
    "\\textbf{mse}(y, \\widehat{y}) = \\frac{1}{N}\\sum_{i=1}^{N} (y_{i} - \\hat{y_{i}})^{2}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "def mean_squared_error(y_true, y_pred):\n",
    "    diffs = (y_true - y_pred)**2\n",
    "    return np.mean(diffs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The raw distances `y_true - y_pred` are often called the __residuals__."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of mean-squared error\n",
    "\n",
    "[0, $\\infty$), with 0 the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by mean-squared error\n",
    "\n",
    "This measure seeks to summarize the errors made by a regression classifier. The smaller it is, the closer the model's predictions are to the truth. In this sense, it is intuitively like a counterpart to [accuracy](#Accuracy) for classifiers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of mean-squared error\n",
    "\n",
    "These values are highly dependent on scale of the output variables, making them very hard to interpret in isolation. One really needs a clear baseline, and scale-independent ways of comparing scores are also needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to mean-squared error\n",
    "\n",
    "Scikit-learn implements a variety of closely related measures: [mean absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error), [mean squared logarithmic error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html#sklearn.metrics.mean_squared_log_error), and [median absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error). I'd say that one should choose among these metrics based on how the output values are scaled and distributed. For instance:\n",
    "\n",
    "* The median absolute error will be less sensitive to outliers than the others.\n",
    "* Mean squared logarithmic error might be more appropriate where the outputs are not strictly speaking linearly increasing. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### R-squared scores\n",
    "\n",
    "The R$^{2}$ score is probably the most prominent method for summarizing regression model performance, in statistics, social sciences, and ML/NLP. This is the value that `sklearn`'s regression models deliver with their `score` functions.\n",
    "\n",
    "$$\n",
    "\\textbf{r2}(y, \\widehat{y}) =\n",
    "    1.0 - \\frac{\n",
    "      \\sum_{i}^{N} (y_{i} - \\hat{y_{i}})^{2}     \n",
    "    }{\n",
    "       \\sum_{i}^{N} (y_{i} - \\mu)^{2}\n",
    "    }\n",
    "$$    \n",
    "where $\\mu$ is the mean of the gold values $y$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "def r2(y_true, y_pred):\n",
    "    mu = y_true.mean()\n",
    "    # Total sum of squares:\n",
    "    total = ((y_true - mu)**2).sum()\n",
    "    # Sum of squared errors:\n",
    "    res = ((y_true - y_pred)**2).sum()\n",
    "    return 1.0 - (res / total)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of R-squared scores\n",
    "\n",
    "[0, 1], with 0 the worst and 1 the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by R-squared scores\n",
    "\n",
    "The numerator in the R$^{2}$ calculation is the sum of errors: \n",
    "\n",
    "$$\n",
    "\\textbf{r2}(y, \\widehat{y}) =\n",
    "    1.0 - \\frac{\n",
    "      \\sum_{i}^{N} (y_{i} - \\hat{y_{i}})^{2}     \n",
    "    }{\n",
    "       \\sum_{i}^{N} (y_{i} - \\mu)^{2}\n",
    "    }\n",
    "$$ \n",
    "\n",
    "In the context of regular linear regression, the model's objective is to minimize the total sum of squares, which is the denominator in the calculation. Thus, R$^{2}$ is based in the ratio between what the model achieved and what its objective was, which is a measure of the goodness of fit of the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of R-squared scores\n",
    "\n",
    "For comparative purposes, it's nice that R$^{2}$ is scaled between [0, 1]; as noted above, this lack of scaling makes mean squared error hard to interpret. But this also represents a trade-off: R$^{2}$ doesn't tell us about the magnitude of the errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to R-squared scores\n",
    "\n",
    "* R$^{2}$ is [closely related to the squared Pearson correlation coefficient](https://en.wikipedia.org/wiki/Coefficient_of_determination#As_squared_correlation_coefficient).\n",
    "\n",
    "* R$^{2}$ is closely related to the [explained variance](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score), which is also defined in terms of a ratio of the residuals and the variation in the gold data. For explained variance, the numerator is the variance of the residuals and the denominator is the variance of the gold values.\n",
    "\n",
    "* [Adjusted R$^{2}$](https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2) seeks to take into account the number of predictors in the model, to reduce the incentive to simply add more features in the hope of lucking into a better score. In ML/NLP, relatively little attention is paid to model complexity in this sense. The attitude is like: if you can improve your model by adding features, you might as well do that!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Pearson correlation\n",
    "\n",
    "The [Pearson correlation coefficient $\\rho$](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between two vectors $y$ and $\\widehat{y}$ of dimension $N$ is:\n",
    "\n",
    "$$\n",
    "\\textbf{pearsonr}(y, \\widehat{y}) = \n",
    "\\frac{\n",
    "  \\sum_{i}^{N} (y_{i} - \\mu_{y}) \\cdot (\\widehat{y}_{i} - \\mu_{\\widehat{y}})\n",
    "}{\n",
    "  \\sum_{i}^{N} (y_{i} - \\mu_{y})^{2} \\cdot (\\widehat{y}_{i} - \\mu_{\\widehat{y}})^{2}\n",
    "}\n",
    "$$\n",
    "where $\\mu_{y}$ is the mean of $y$ and $\\mu_{\\widehat{y}}$ is the mean of $\\widehat{y}$.\n",
    "\n",
    "This is implemented as `scipy.stats.pearsonr`, which returns the coefficient and a p-value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of Pearson correlations\n",
    "\n",
    "$[-1, 1]$, where $-1$ is a complete negative linear correlation, $+1$ is a complete positive linear correlation, and $0$ is no linear correlation at all."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of Pearson correlation\n",
    "\n",
    "Pearson correlations are highly sensitive to the magnitude of the differences between the gold and predicted values. As a result, they are also very sensitive to outliers. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to Pearson correlation\n",
    "\n",
    "* For comparing gold values $y$ and predicted values $\\widehat{y}$, Pearson correlation is equivalent to a linear regression using $\\widehat{y}$ and a bias term to predict $y$. [See this great blog post for details.](https://lindeloev.github.io/tests-as-linear/)\n",
    "\n",
    "* [As noted above](#Related-to-R-squared-scores), there is also a close relationship to R-squared values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Spearman rank correlation\n",
    "\n",
    "The Spearman rank correlation coefficient between between two vectors $y$ and $\\widehat{y}$ of dimension $N$ is the Pearson coefficient with all of the data mapped to their ranks.\n",
    "\n",
    "It is implemented as `scipy.stats.spearmanr`, which returns the coefficient and a p-value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_df = pd.DataFrame({\n",
    "    'y1': np.random.uniform(-10, 10, size=1000),\n",
    "    'y2': np.random.uniform(-10, 10, size=1000)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SpearmanrResult(correlation=-0.014107322107322106, pvalue=0.6559028954995918)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scipy.stats.spearmanr(corr_df['y1'], corr_df['y2'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-0.014107322107322122, 0.6559028954996294)"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scipy.stats.pearsonr(corr_df['y1'].rank(), corr_df['y2'].rank())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of Spearman rank correlations\n",
    "\n",
    "$[-1, 1]$, where $-1$ is a complete negative linear correlation, $+1$ is a complete positive linear correlation, and $0$ is no linear correlation at all."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of Spearman rank correlation\n",
    "\n",
    "Unlike Pearson, Spearman is not sensitive to the magnitude of the differences. In fact, it's invariant under all monotonic rescaling, since the values are converted to ranks. This also makes it less sensitive to outliers than Pearson.\n",
    "\n",
    "Of course, these strengths become weaknesses in domains where the raw differences do matter. That said, in most NLU contexts, Spearman will be a good conservative choice for system assessment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to Spearman rank correlation\n",
    "\n",
    "For comparing gold values $y$ and predicted values $\\widehat{y}$, Pearson correlation is equivalent to a linear regression using $\\textbf{rank}(\\widehat{y})$ and a bias term to predict $\\textbf{rank}(y)$. [See this great blog post for details.](https://lindeloev.github.io/tests-as-linear/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Sequence prediction\n",
    "\n",
    "Sequence prediction metrics all seek to summarize and quantify the extent to which a model has managed to reproduce, or accurately match, some gold standard sequences. Such problems arise throughout NLP. Examples: \n",
    "\n",
    "1. Mapping speech signals to their desired transcriptions.\n",
    "1. Mapping texts in a language $L_{1}$ to their translations in a distinct language or dialect $L_{2}$.\n",
    "1. Mapping input dialogue acts to their desired responses.\n",
    "1. Mapping a sentence to one of its paraphrases.\n",
    "1. Mapping real-world scenes or contexts (non-linguistic) to descriptions of them (linguistic)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Evaluations is very challenging because the relationships tend to be __many-to-one__: a given sentence might have multiple suitable translations; a given dialogue act will always have numerous felicitous responses; any scene can be described in multiple ways; and so forth. The most constrained of these problems is the speech-to-text case in 1, but even that one has indeterminacy in real-world contexts (humans often disagree about how to transcribe spoken language)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Word error rate\n",
    "\n",
    "The [word error rate](https://en.wikipedia.org/wiki/Word_error_rate) (WER) metric is a word-level, length-normalized measure of [Levenshtein string-edit distance](https://en.wikipedia.org/wiki/Levenshtein_distance):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "def wer(seq_true, seq_pred):\n",
    "    d = edit_distance(seq_true, seq_pred)\n",
    "    return d / len(seq_true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3333333333333333"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wer(['A', 'B', 'C'], ['A', 'A', 'C'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.25"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wer(['A', 'B', 'C', 'D'], ['A', 'A', 'C', 'D'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "To calculate this over the entire test-set, one gets the edit-distances for each gold–predicted pair and normalizes these by the length of all the gold examples, rather than normalizing each case:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "def corpus_wer(y_true, y_pred):\n",
    "    dists = [edit_distance(seq_true, seq_pred)\n",
    "             for seq_true, seq_pred in zip(y_true, y_pred)]\n",
    "    lengths = [len(seq) for seq in y_true]\n",
    "    return sum(dists) / sum(lengths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives a single summary value for the entire set of errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of word error rate\n",
    "\n",
    "$[0, \\infty)$, where 0 is best. (The lack of a finite upper bound derives from the fact that the normalizing constant is given by the true sequences, and the predicted sequences can differ from them in any conceivable way in principle.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by word error rate\n",
    "\n",
    "This method says that our desired notion of closeness or accuracy can be operationalized in terms of the low-level operations of insertion, deletion, and substitution. The guiding intuition is very much like that of F scores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of word error rate\n",
    "\n",
    "The value encoded reveals a potential weakness in certain domains. Roughly, the more __semantic__ the task, the less appropriate WER is likely to be. \n",
    "\n",
    "For example, adding a negation to a sentence will radically change its meaning but incur only a small WER penalty, whereas passivizing a sentence (_Kim won the race_ &rarr; _The race was won by Kim_) will hardly change its meaning at all but incur a large WER penalty. \n",
    "\n",
    "See also [Liu et al. 2016](https://www.aclweb.org/anthology/D16-1230) for similar arguments in the context of dialogue generation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to word error rate\n",
    "\n",
    "* WER can be thought of as a family of different metrics varying in the notion of edit distance that they employ.\n",
    "\n",
    "* The Word Accuracy Rate is 1.0 minus the WER, which, despits its name, is intuitively more like [recall](#Recall) than [accuracy](#Accuracy)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### BLEU scores\n",
    "\n",
    "BLEU (Bilingual Evaluation Understudy) scores were originally developed in the context of machine translation, but they are applied in other generation tasks as well. For BLEU scoring, we require a set of gold outputs. The metric has two main components:\n",
    "\n",
    "* __Modified n-gram precision__: A direct application of precision would divide the number of correct n-grams in the predicted output (n-grams that appear in any translation) by the number of n-grams in the predicted output. This has a degenerate solution in which the predicted output contains only one word. BLEU's modified version substitutes the actual count for each n-gram by the maximum number of times it appears in any translation.\n",
    "\n",
    "* __Brevity penalty (BP)__: to avoid favoring outputs that are too short, a penalty is applied. Let $Y$ be the set of gold outputs, $\\widehat{y}$ the predicted output, $c$ the length of the predicted output, and $r$ the smallest absolute difference between the length of $c$ and the length of any of its gold outputs in $Y$. Then:\n",
    "\n",
    "$$\\textbf{BP}(Y, \\widehat{y}) =\n",
    "\\begin{cases}\n",
    "1 & \\textrm{ if } c > r \\\\\n",
    "\\exp(1 - \\frac{r}{c}) & \\textrm{otherwise}\n",
    "\\end{cases}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "The BLEU score itself is typically a combination of modified n-gram precision for various $n$ (usually up to 4):\n",
    "\n",
    "$$\\textbf{BLEU}(Y, \\widehat{y}) = \\textbf{BP}(Y, \\widehat{y}) \\cdot \n",
    "    \\exp\\left(\\sum_{n=1}^{N} w_{n} \\cdot \\log\\left(\\textbf{modified-precision}(Y, \\widehat{y}, n\\right)\\right)$$\n",
    "\n",
    "where $Y$ is the set of gold outputs, $\\widehat{y}$ is the predicted output, and $w_{n}$ is a weight for each $n$-gram level (usually set to $1/N$).\n",
    "\n",
    "NLTK has [implementations of Bleu scoring](http://www.nltk.org/_modules/nltk/translate/bleu_score.html) for the sentence-level, as defined above, and for the corpus level (`nltk.translate.bleu_score.corpus_bleu`). At the corpus level, it is typical to do a kind of [micro-averaging](#Micro-averaged-F-scores) of the modified precision scores and use a cumulative version of the brevity penalty."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of BLEU scores\n",
    "\n",
    "[0, 1], with 1 being the best, though with no expectation that any system will achieve 1, since even sets of human-created translations do not reach this level."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Value encoded by BLEU scores\n",
    "\n",
    "BLEU scores attempt to achieve the same balance between precision and recall that runs through the majority of the metrics discussed here. It has many affinities with [word error rate](#Word-error-rate), but seeks to accommodate the fact that there are typically multiple suitable outputs for a given input."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of BLEU scores\n",
    "\n",
    "* [Callison-Burch et al. (2006)](http://www.aclweb.org/anthology/E06-1032) criticize BLEU as a machine translation metric on the grounds that it fails to correlate with human scoring of translations. They highlight its insensitivity  to n-gram order and its insensitivity to n-gram types (e.g., function vs. content words) as causes of this lack of correlation.\n",
    "\n",
    "* [Liu et al. (2016)](https://www.aclweb.org/anthology/D16-1230) specifically argue against BLEU as a metric for assessing dialogue systems, based on a lack of correlation with human judgments about dialogue coherence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to BLEU scores\n",
    "\n",
    "There are many competitors/alternatives to BLEU, most proposed in the context of machine translation. Examples: [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric), [METEOR](https://en.wikipedia.org/wiki/METEOR), [HyTER](http://www.aclweb.org/anthology/N12-1017), [Orange (smoothed Bleu)](http://www.aclweb.org/anthology/C04-1072)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Perplexity\n",
    "\n",
    "[Perplexity](https://en.wikipedia.org/wiki/Perplexity) is a common metric for directly assessing generation models by calculating the probability that they assign to sequences in the test data. It is based in a measure of average surprisal:\n",
    "\n",
    "$$H(P, x) = -\\frac{1}{m}\\log_{2} P(x)$$\n",
    "\n",
    "where $P$ is a model assigning probabilities to sequences and $x$ is a sequence.\n",
    "\n",
    "Perplexity is then the exponent of this:\n",
    "\n",
    "$$\\textbf{perplexity}(P, x) = 2^{H(P, x)}$$\n",
    "\n",
    "Using any base $n$ both in defining $H$ and as the base in $\\textbf{perplexity}$ will lead to identical results.\n",
    "\n",
    "Minimizing perplexity is equivalent to maximizing probability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "It is common to report per-token perplexity; here the averaging should be done in log-space to deliver a [geometric mean](https://en.wikipedia.org/wiki/Geometric_mean):\n",
    "\n",
    "$$\\textbf{token-perplexity}(P, x) = \\exp\\left(\\frac{\\log\\textbf{perplexity}(P, x)}{\\textbf{length}(x)}\\right)$$\n",
    "\n",
    "When averaging perplexity values obtained from all the sequences in a text corpus, one should again use the geometric mean:\n",
    "\n",
    "$$\\textbf{mean-perplexity}(P, X) = \n",
    "\\exp\\left(\\frac{1}{m}\\sum_{x\\in X}\\log(\\textbf{token-perplexity}(P, x))\\right)$$\n",
    "\n",
    "for a set of $m$ examples $X$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Bounds of perplexity\n",
    "\n",
    "[1, $\\infty$], where 1 is best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Values encoded by perplexity\n",
    "\n",
    "The guiding idea behind perplexity is that a good model will assign high probability to the sequences in the test data. This is an intuitive, expedient intrinsic evaluation, and it matches well with the objective for models trained with a cross-entropy or logistic objective."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Weaknesses of perplexity\n",
    "\n",
    "* Perplexity is heavily dependent on the nature of the underlying vocabulary in the following sense: one can artificially lower one's perplexity by having a lot of `UNK` tokens in the training and test sets. Consider the extreme case in which _everything_ is mapped to `UNK` and perplexity is thus perfect on any test set. The more worrisome thing is that any amount of `UNK` usage side-steps the pervasive challenge of dealing with infrequent words.\n",
    "\n",
    "* [As Hal Daumé discusses in this post](https://nlpers.blogspot.com/2014/05/perplexity-versus-error-rate-for.html), the perplexity metric imposes an artificial constrain that one's model outputs are probabilistic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Related to perplexity\n",
    "\n",
    "Perplexity is the inverse of probability and, [with some assumptions](http://www.cs.cmu.edu/~roni/11761/PreviousYearsHandouts/gauntlet.pdf), can be seen as an approximation of the cross-entropy between the model's predictions and the true underlying sequence probabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Other resources\n",
    "\n",
    "The scikit-learn [model evaluation usage guide](http://scikit-learn.org/stable/modules/model_evaluation.html) is a great resource for metrics I didn't cover here. In particular:\n",
    "\n",
    "* Clustering\n",
    "\n",
    "* Ranking\n",
    "\n",
    "* Inter-annotator agreement"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
